feat(transformers): upgrade qwen3 model to v4.57.1#1492
feat(transformers): upgrade qwen3 model to v4.57.1#1492wcrzlh wants to merge 4 commits intomindspore-lab:masterfrom
Conversation
Summary of ChangesHello @wcrzlh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request upgrades the Qwen3 model to version 4.57.1, focusing on enhancing modularity, reusability, and performance within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the Qwen3 model by introducing generic classification and question-answering layers, simplifying the Qwen3Model and Qwen3ForCausalLM constructs, and updating various internal attributes and method signatures. Specifically, Qwen3DecoderLayer now inherits from GradientCheckpointingLayer, and the Qwen3Attention.construct method's return type hint was identified as incorrect. The past_key_value parameter was consistently renamed to past_key_values across relevant methods. Additionally, the PR removed several get/set embedding methods and gradient checkpointing warning logic. A significant change in the test suite involved disabling graph mode tests, which the reviewer noted contradicts the _can_compile_fullgraph = True flag in Qwen3PreTrainedModel, indicating a potential regression or incomplete feature support that needs addressing or documentation.
|
|
||
|
|
||
| class Qwen3DecoderLayer(nn.Cell): | ||
| class Qwen3DecoderLayer(GradientCheckpointingLayer): |
There was a problem hiding this comment.
This change makes Qwen3DecoderLayer inherit from GradientCheckpointingLayer. However, the GradientCheckpointingLayer in mindone/transformers/modeling_layers.py has a __call__ method that raises a NotImplementedError when gradient checkpointing is enabled. This means gradient checkpointing will not work for this model, which contradicts the supports_gradient_checkpointing = True flag in Qwen3PreTrainedModel. Please implement the gradient checkpointing logic in GradientCheckpointingLayer or revert this inheritance if it's not supported.
|
|
||
| DTYPE_AND_THRESHOLDS = {"fp32": 5e-4, "fp16": 5e-3, "bf16": 5e-2} | ||
| MODES = [0, 1] | ||
| MODES = [1] |
There was a problem hiding this comment.
Disabling the graph mode test (MODES = [0, 1] to MODES = [1]) indicates a potential regression or incomplete feature support for MindSpore's graph mode. The Qwen3PreTrainedModel class sets _can_compile_fullgraph = True, which creates a contradiction. The model should be fully functional in both pynative and graph modes. Please restore the graph mode test and fix any underlying issues. If there are known limitations, they should be documented, and the _can_compile_fullgraph flag should be set to False.
| past_key_values: Optional[Cache] = None, | ||
| cache_position: Optional[ms.Tensor] = None, | ||
| **kwargs: Unpack[FlashAttentionKwargs], | ||
| ) -> Tuple[Optional[ms.Tensor], Optional[Tuple[ms.Tensor]]]: |
There was a problem hiding this comment.
The return type hint -> Tuple[Optional[ms.Tensor], Optional[Tuple[ms.Tensor]]] is incorrect. The function returns a tuple of (attn_output, attn_weights), where attn_output is a ms.Tensor and attn_weights is an Optional[ms.Tensor]. The type hint should be updated to -> Tuple[ms.Tensor, Optional[ms.Tensor]] to match the implementation.
| ) -> Tuple[Optional[ms.Tensor], Optional[Tuple[ms.Tensor]]]: | |
| ) -> Tuple[ms.Tensor, Optional[ms.Tensor]]: |
What does this PR do?
Fixes # (issue)
Adds # (feature)
Before submitting
What's New. Here are thedocumentation guidelines
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@xxx