Optimize Qwen3 scope1 decode performance#102
Optimize Qwen3 scope1 decode performance#102ndleslx wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIncreases Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the qwen3_32b_decode_scope1.py example by increasing the K_CHUNK size and refactoring the RMSNorm and projection stages. The changes introduce parallelization and chunked loop optimization, fusing normalization steps directly into the Q, K, and V projection loops to improve memory efficiency and performance by eliminating large intermediate tensors. I have no feedback to provide.
- parallelize RMS partial reduction and Q/K/V output chunk loops - increase K_CHUNK to 512 and normalize chunks on demand to reduce wall time
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/models/qwen3/qwen3_32b_decode_scope1.py (1)
32-49:⚠️ Potential issue | 🟠 MajorValidate
hidden_sizeagainst the largerK_CHUNK.Line 47 now truncates to
hidden // 512, while Lines 99-107 and 130-140 still assume a full 512-wide chunk exists. That means any non-defaulthidden_sizethat's not a multiple of 512 will silently drop the tail in the compiled path, whilegolden_qwen3_scope1still processes it via Lines 234-236. Please fail fast here or add tail handling before this lands.Proposed guard
def build_qwen3_scope1_program( batch: int = BATCH, hidden_size: int = HIDDEN, num_kv_heads: int = NUM_KV_HEADS, head_dim: int = HEAD_DIM, ): hidden = hidden_size kv_hidden = num_kv_heads * head_dim + if hidden % K_CHUNK != 0: + raise ValueError( + f"hidden_size ({hidden}) must be a multiple of K_CHUNK ({K_CHUNK})" + ) hidden_blocks = hidden // K_CHUNK q_out_blocks = hidden // Q_OUT_CHUNK kv_out_blocks = kv_hidden // KV_OUT_CHUNK🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/qwen3/qwen3_32b_decode_scope1.py` around lines 32 - 49, The build_qwen3_scope1_program currently computes hidden_blocks = hidden // K_CHUNK which silently drops any hidden_size tail if hidden_size is not a multiple of K_CHUNK; update build_qwen3_scope1_program to either (a) validate and fail fast by checking hidden_size % K_CHUNK == 0 and raise a clear error (e.g., ValueError) referencing K_CHUNK, hidden_size and hidden_blocks, or (b) implement explicit tail handling so the compiled path matches golden_qwen3_scope1 by processing the final partial block (adjust q_out_blocks/kv_out_blocks/MLP handling accordingly). Make the change within build_qwen3_scope1_program and ensure all dependent computed names (hidden_blocks, q_out_blocks, kv_out_blocks, KV_OUT_CHUNK, Q_OUT_CHUNK, MLP_OUT_CHUNK) are updated to reflect the validation or tail case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope1.py`:
- Around line 32-49: The build_qwen3_scope1_program currently computes
hidden_blocks = hidden // K_CHUNK which silently drops any hidden_size tail if
hidden_size is not a multiple of K_CHUNK; update build_qwen3_scope1_program to
either (a) validate and fail fast by checking hidden_size % K_CHUNK == 0 and
raise a clear error (e.g., ValueError) referencing K_CHUNK, hidden_size and
hidden_blocks, or (b) implement explicit tail handling so the compiled path
matches golden_qwen3_scope1 by processing the final partial block (adjust
q_out_blocks/kv_out_blocks/MLP handling accordingly). Make the change within
build_qwen3_scope1_program and ensure all dependent computed names
(hidden_blocks, q_out_blocks, kv_out_blocks, KV_OUT_CHUNK, Q_OUT_CHUNK,
MLP_OUT_CHUNK) are updated to reflect the validation or tail case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8e0e54de-506a-431e-b932-046c5a3f13f9
📒 Files selected for processing (1)
examples/models/qwen3/qwen3_32b_decode_scope1.py
Summary
Related Issues