feats(transformers):add glm4v_moe model#1447
feats(transformers):add glm4v_moe model#1447iugoood wants to merge 1 commit intomindspore-lab:masterfrom
Conversation
Summary of ChangesHello @iugoood, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the GLM-4.1V Mixture-of-Experts (MoE) model to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Summary of ChangesHello @iugoood, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the mindone/transformers library by integrating the GLM4v-MoE model. This new model is a multimodal transformer that leverages a Mixture-of-Experts architecture, enabling it to process and generate content based on both text and visual inputs. The changes involve adding the complete model architecture, updating auto-configuration mechanisms, and introducing a dedicated test suite to ensure its proper functioning within the MindSpore framework. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Summary of ChangesHello @iugoood, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the mindone/transformers library by adding the glm4v_moe model. This new model is designed for multimodal tasks, combining text and vision processing capabilities. It features a Mixture-of-Experts architecture for efficient text handling and specialized components for processing visual inputs, including advanced rotary position embeddings. The integration ensures seamless use within the existing framework, supported by updated auto-loading mechanisms and a dedicated test suite. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR adds the glm4v_moe model. The changes are extensive, including the model implementation, boilerplate for auto-classes, and a new test file. The model implementation is complex, especially the multimodal parts and the Mixture of Experts (MoE) layers. I've found a few issues related to correctness, performance, and code clarity that should be addressed. Specifically, there's a typo in a class name, a potential dtype mismatch for Flash Attention, an inefficient and likely buggy loop in position index calculation, an inconsistency in token ID usage, and some dead code. The new test file is a good start but could be expanded to cover video inputs to catch some of these issues.
| cu_seqlens = mindspore.mint.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum( | ||
| dim=0, | ||
| # Select dtype based on the following factors: | ||
| # - FA2 requires that cu_seqlens_q must have dtype int32 | ||
| # See https://github.com/huggingface/transformers/pull/34852 for more information | ||
| dtype=grid_thw.dtype, | ||
| ) |
There was a problem hiding this comment.
The comment indicates that Flash Attention 2 requires cu_seqlens to have a dtype of int32. However, the code sets dtype=grid_thw.dtype, which could be int64 and cause issues. It's safer to explicitly set the dtype to mindspore.int32 to ensure compatibility with Flash Attention 2.
| cu_seqlens = mindspore.mint.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum( | |
| dim=0, | |
| # Select dtype based on the following factors: | |
| # - FA2 requires that cu_seqlens_q must have dtype int32 | |
| # See https://github.com/huggingface/transformers/pull/34852 for more information | |
| dtype=grid_thw.dtype, | |
| ) | |
| cu_seqlens = mindspore.mint.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum( | |
| dim=0, | |
| # Select dtype based on the following factors: | |
| # - FA2 requires that cu_seqlens_q must have dtype int32 | |
| # See https://github.com/huggingface/transformers/pull/34852 for more information | |
| dtype=mindspore.int32, | |
| ) |
| for t_idx in range(llm_grid_t): | ||
| t_index = ( | ||
| mindspore.Tensor(t_idx).view(-1, 1).expand((-1, llm_grid_h * llm_grid_w)).flatten() | ||
| ) | ||
|
|
||
| h_index = ( | ||
| mindspore.mint.arange(llm_grid_h).view(1, -1, 1).expand((1, -1, llm_grid_w)).flatten() | ||
| ) | ||
| w_index = ( | ||
| mindspore.mint.arange(llm_grid_w).view(1, 1, -1).expand((1, llm_grid_h, -1)).flatten() | ||
| ) | ||
| llm_pos_ids_list.append(mindspore.mint.stack([t_index, h_index, w_index]) + st_idx) |
There was a problem hiding this comment.
The logic for calculating video position indices appears to be inefficient and potentially incorrect. The loop over llm_grid_t (which is video_frame_num) seems to have O(N^2) complexity with respect to the number of video frames, as it recomputes embeddings for previous frames at each step. Additionally, creating a tensor with mindspore.Tensor(t_idx) inside a loop is inefficient.
A more efficient and likely correct implementation would process one frame per video token without a loop, similar to how image position indices are calculated. This would improve performance and avoid redundant computations.
| special_video_mask = inputs_embeds == self.get_input_embeddings()( | ||
| mindspore.Tensor( | ||
| self.config.video_token_id, | ||
| dtype=mindspore.int64, | ||
| ) | ||
| ) | ||
| special_video_mask = special_video_mask.all(-1) |
There was a problem hiding this comment.
There's an inconsistency in creating the special_video_mask. When input_ids is None, it uses self.config.video_token_id, but when input_ids is present, it uses self.config.image_token_id. Based on the comment on line 1301 and the logic in get_rope_index, it seems image_token_id is intended for video frames as well. To ensure consistency, image_token_id should be used in both cases.
| special_video_mask = inputs_embeds == self.get_input_embeddings()( | |
| mindspore.Tensor( | |
| self.config.video_token_id, | |
| dtype=mindspore.int64, | |
| ) | |
| ) | |
| special_video_mask = special_video_mask.all(-1) | |
| special_video_mask = inputs_embeds == self.get_input_embeddings()( | |
| mindspore.Tensor( | |
| self.config.image_token_id, | |
| dtype=mindspore.int64, | |
| ) | |
| ) | |
| special_video_mask = special_video_mask.all(-1) |
| module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) | ||
|
|
||
|
|
||
| class Glm4vMoeisionMlp(mindspore.nn.Cell): |
| if expand_size == 1: | ||
| return input_ids, model_kwargs | ||
|
|
||
| visual_keys = ["pixel_values", "image_grid_thw", "pixel_values_videos", "video_grid_thw", "second_per_grid_ts"] |
There was a problem hiding this comment.
The key second_per_grid_ts is included in visual_keys but appears to be unused throughout the model's forward pass. This suggests it might be dead code. To improve code clarity and maintainability, it should be removed from this list.
| visual_keys = ["pixel_values", "image_grid_thw", "pixel_values_videos", "video_grid_thw", "second_per_grid_ts"] | |
| visual_keys = ["pixel_values", "image_grid_thw", "pixel_values_videos", "video_grid_thw"] |
| elif key == "second_per_grid_ts": | ||
| dict_to_expand[key] = _repeat_interleave_samples( | ||
| dict_to_expand[key], lengths=list(video_nums), repeat_times=expand_size | ||
| ) |
There was a problem hiding this comment.
Code Review
This pull request introduces the glm4v_moe model, a significant new feature. The implementation is comprehensive, covering the model architecture, auto-class registration, and initial tests. However, I have identified several critical and high-severity performance issues related to non-vectorized operations, such as loops over batch items or experts, and code that is not compatible with graph compilation. Addressing these points will be crucial for making the model efficient and usable in production environments. I've also noted a minor typo in a class name.
| ) | ||
| image_index, video_index = 0, 0 | ||
| video_group_index = 0 | ||
| for i, input_ids in enumerate(total_input_ids): |
There was a problem hiding this comment.
The for loop iterating over total_input_ids processes each item in the batch sequentially. This is highly inefficient and will be a major performance bottleneck, especially for larger batch sizes. The logic within the loop, including tolist() and itertools.groupby, is also not suitable for graph compilation and is slow. This entire function should be vectorized to process the whole batch at once using tensor operations.
| for expert_idx in range(len(self.experts)): | ||
| expert = self.experts[expert_idx] | ||
| mask = expert_mask[expert_idx] | ||
| token_indices, weight_indices = mindspore.mint.where(mask) | ||
|
|
||
| if token_indices.numel() > 0: | ||
| expert_weights = topk_weights[token_indices, weight_indices] | ||
| expert_input = hidden_states[token_indices] | ||
| expert_output = expert(expert_input) | ||
| weighted_output = expert_output * expert_weights.unsqueeze(-1) | ||
| final_hidden_states.index_add_(0, token_indices, weighted_output) |
There was a problem hiding this comment.
The loop over self.experts is a significant performance bottleneck, especially when the number of experts is large. This implementation will execute the expert forward pass sequentially for each expert, which is inefficient. To optimize this, consider vectorizing the expert computation. You could explore techniques used in other MoE implementations within this library (like Mixtral), which often involve grouping tokens by expert and performing batched computations to avoid Python loops.
| samples = mindspore.mint.split(image_grid_thw, list(image_nums)) | ||
| # compute the sequence length of images for each sample | ||
| lengths = [mindspore.mint.prod(sample, dim=1).sum() for sample in samples] |
There was a problem hiding this comment.
The list comprehension used to calculate lengths involves tensor operations inside a loop, which is inefficient and not friendly to graph compilation. This can be vectorized using ops.segment_sum to improve performance significantly.
| samples = mindspore.mint.split(image_grid_thw, list(image_nums)) | |
| # compute the sequence length of images for each sample | |
| lengths = [mindspore.mint.prod(sample, dim=1).sum() for sample in samples] | |
| prods = mindspore.mint.prod(image_grid_thw, dim=1) | |
| segment_ids = ops.arange(image_nums.shape[0]).repeat_interleave(image_nums.to(ms.int32)) | |
| lengths = ops.segment_sum(prods, segment_ids).asnumpy().tolist() |
| module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) | ||
|
|
||
|
|
||
| class Glm4vMoeisionMlp(mindspore.nn.Cell): |
There was a problem hiding this comment.
| self.norm1 = Glm4vMoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
| self.norm2 = Glm4vMoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
| self.attn = Glm4vMoeVisionAttention(config) | ||
| self.mlp = Glm4vMoeisionMlp(config, bias=False) |
| def rot_pos_emb(self, grid_thw): | ||
| pos_ids = [] | ||
| for t, h, w in grid_thw: | ||
| t, h, w = int(t), int(h), int(w) |
There was a problem hiding this comment.
The explicit casting of tensor elements to Python integers using int() will break graph compilation (mindspore.jit). The test file for this model currently only runs in PyNative mode, which might be why this was not caught. To make the model compatible with graph mode for better performance, you should use tensor operations instead. For instance, mindspore.mint.arange(h) can accept a 0-d tensor h.
| t, h, w = int(t), int(h), int(w) | |
| t, h, w = t.item(), h.item(), w.item() |
There was a problem hiding this comment.
Code Review
This pull request introduces the glm4v_moe model, a multimodal model with a Mixture-of-Experts architecture. The implementation is adapted from the HuggingFace Transformers library for MindSpore. The changes include the model definition, configuration, and integration into the auto-model classes, along with a new test suite.
My review has identified a few issues:
- A critical bug in
_expand_inputs_for_generationthat would lead to a runtime error. - A performance issue in
get_rope_indexdue to redundant computations inside a loop. - A typo in a class name (
Glm4vMoeisionMlp). - A developer-facing comment in the
moemethod that should be removed from production code.
I've provided suggestions to fix these issues. After addressing them, the PR should be in good shape.
| dict_to_expand[key] = _repeat_interleave_samples( | ||
| dict_to_expand[key], lengths=list(video_nums), repeat_times=expand_size | ||
| ) |
There was a problem hiding this comment.
The code for key == 'second_per_grid_ts' attempts to call _repeat_interleave_samples on a list, but this function expects a tensor. This will cause a runtime error. The list should be converted to a tensor before being passed to _repeat_interleave_samples and then converted back to a list.
| dict_to_expand[key] = _repeat_interleave_samples( | |
| dict_to_expand[key], lengths=list(video_nums), repeat_times=expand_size | |
| ) | |
| tensor = mindspore.Tensor(dict_to_expand[key]) | |
| tensor = _repeat_interleave_samples( | |
| tensor, lengths=list(video_nums), repeat_times=expand_size | |
| ) | |
| dict_to_expand[key] = tensor.tolist() |
| r""" | ||
| CALL FOR CONTRIBUTION! I don't have time to optimise this right now, but expert weights need to be fused | ||
| to not have to do a loop here (deepseek has 256 experts soooo yeah). | ||
| """ |
There was a problem hiding this comment.
| module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) | ||
|
|
||
|
|
||
| class Glm4vMoeisionMlp(mindspore.nn.Cell): |
There was a problem hiding this comment.
| self.norm1 = Glm4vMoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
| self.norm2 = Glm4vMoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
| self.attn = Glm4vMoeVisionAttention(config) | ||
| self.mlp = Glm4vMoeisionMlp(config, bias=False) |
| h_index = ( | ||
| mindspore.mint.arange(llm_grid_h).view(1, -1, 1).expand((1, -1, llm_grid_w)).flatten() | ||
| ) | ||
| w_index = ( | ||
| mindspore.mint.arange(llm_grid_w).view(1, 1, -1).expand((1, llm_grid_h, -1)).flatten() | ||
| ) |
7b5b8f0 to
096fbf1
Compare
Add:
1、add glm4v_moe model
2、add UT
Notes : not validated with official released models due to large size (>100b)
zai-org/GLM-4.6V
could try with zero3 distrubuted inference for attempts.