Qwen3 Scope3 refactor with performance improvements#90
Qwen3 Scope3 refactor with performance improvements#90zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdated Qwen3 32B scope-3 decode kernel: HIDDEN increased 5120→8192, output-projection/residual accumulation postponed until after full per-tile projection, post-attention RMSNorm refactored with intermediate BF16 cast, and MLP down-projection reworked to use an Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py (1)
73-73: Consider using ASCII character for multiplication.The comment uses a Unicode multiplication sign (
×, U+00D7) which may cause issues with text search tools. Consider using ASCIIxor*for better searchability.- # Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK. + # Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 73, Replace the Unicode multiplication sign in the comment "Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII character (use either "x" or "*") so text search and tooling won't be disrupted; update the comment to something like "Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK." to locate it by the same phrase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 73: Replace the Unicode multiplication sign in the comment "Stage 1:
Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII character
(use either "x" or "*") so text search and tooling won't be disrupted; update
the comment to something like "Stage 1: Output projection: attn_out x wo, tiled
by Q_OUT_CHUNK." to locate it by the same phrase.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 244499f1-0dc6-46c3-a04b-e9762df4b99e
📒 Files selected for processing (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py
There was a problem hiding this comment.
Code Review
This pull request increases the HIDDEN dimension to 8192 and refactors the decoding pipeline in qwen3_32b_decode_scope3.py, including changes to residual addition and the introduction of an mlp_tile for intermediate MLP results. However, the feedback highlights several critical memory constraints: the increased hidden size exceeds hardware buffer limits given the current BATCH_TILE setting, and the use of pl.incore() in multiple stages is likely to cause overflows as the tensors are too large for local memory. It is recommended to revert to pl.auto_incore() and adjust the tiling strategy to stay within hardware limits.
|
|
||
| BATCH = 16 | ||
| HIDDEN = 5120 | ||
| HIDDEN = 8192 |
There was a problem hiding this comment.
Increasing HIDDEN to 8192 while maintaining BATCH_TILE = 16 (line 33) will likely cause a memory overflow in the on-chip vector buffer. For instance, resid1_tile (FP32) now requires qwen3_tilelet.md documentation. This is consistent with the 507018 runtime error reported in the PR description. Consider reducing BATCH_TILE to 4 to stay within hardware constraints, as seen in the main qwen3_32b_decode.py file.
| post_norm_tile = pl.create_tensor([BATCH_TILE, HIDDEN_CFG], dtype=pl.BF16) | ||
| down_proj_tile = pl.create_tensor([BATCH_TILE, HIDDEN_CFG], dtype=pl.FP32) | ||
| with pl.auto_incore(): | ||
| with pl.incore(): |
There was a problem hiding this comment.
Switching from pl.auto_incore() to pl.incore() here is risky. pl.incore() asserts that the entire block's data fits in local memory. Given that resid1_tile (512 KB) and post_norm_tile (256 KB) are both active in this scope and exceed the 248 KB hardware limit, this will likely cause a compilation or runtime failure. It is safer to use pl.auto_incore() to allow the compiler to automatically handle tiling and data movement between global and local memory.
| with pl.incore(): | |
| with pl.auto_incore(): |
| post_norm_tile = pl.assemble(post_norm_tile, normed_bf16, [0, k0]) | ||
|
|
||
| # Stage 4: Initialize mlp_tile | ||
| mlp_tile = pl.create_tensor([BATCH_TILE, INTER_CFG], dtype=pl.BF16) |
There was a problem hiding this comment.
The mlp_tile tensor of size [BATCH_TILE, INTER_CFG] (approx. 800 KB for BATCH_TILE=16) is too large to be allocated as a local tensor using pl.create_tensor. This will lead to a memory overflow on the device. Instead of materializing the full intermediate MLP output in local memory, consider a tiling strategy that processes chunks of the intermediate dimension and accumulates them directly into the down-projection output, or ensure the tensor is placed in global memory if the DSL supports it.
| # Stage 8 & 9: Down projection + final residual writeback. | ||
| for dob in pl.range(HIDDEN_BLOCKS): | ||
| d0 = dob * K_CHUNK | ||
| with pl.incore(): |
There was a problem hiding this comment.
Using pl.incore() inside the dob loop is problematic because it assumes the sliced tensors (mlp_tile, resid1_tile) are resident in local memory. Given their sizes, they cannot be fully resident. If the intention was to slice from global memory, pl.auto_incore() should be used to allow the compiler to manage the necessary DMA transfers and tiling.
| with pl.incore(): | |
| with pl.auto_incore(): |
- rewrite scope3 into explicit stage-by-stage sections with clearer dataflow\n- switch projection accumulations to matmul/matmul_acc and streamline residual assembly\n- update default shape baseline (HIDDEN=8192) and keep existing chunk constants\n- refresh tensor initializers and remove dead local from golden\n- expose CLI platform/device/profiling options and backend selection in compile_and_run\n- validate on a2a3sim: out matched 131072/131072
0643f2a to
f2afe11
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py (1)
73-73: Consider replacing Unicode multiplication sign with ASCIIx.Static analysis flagged the ambiguous
×(MULTIPLICATION SIGN) character. While harmless, using ASCIIximproves searchability and cross-editor compatibility.📝 Suggested fix
- # Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK. + # Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 73, Replace the Unicode multiplication sign in the comment "Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII "x" to improve searchability and editor compatibility; locate the comment near the output projection description referencing symbols attn_out, wo and Q_OUT_CHUNK in qwen3_32b_decode_scope3.py and update it to read "attn_out x wo" (or "attn_out x wo, tiled by Q_OUT_CHUNK.") without changing surrounding code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 73: Replace the Unicode multiplication sign in the comment "Stage 1:
Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII "x" to
improve searchability and editor compatibility; locate the comment near the
output projection description referencing symbols attn_out, wo and Q_OUT_CHUNK
in qwen3_32b_decode_scope3.py and update it to read "attn_out x wo" (or
"attn_out x wo, tiled by Q_OUT_CHUNK.") without changing surrounding code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b1bde245-1fab-4e89-ad65-a1934bc51af4
📒 Files selected for processing (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py
Summary
examples/models/qwen3/qwen3_32b_decode_scope3.pyinto clearer stage-by-stage structurematmul+matmul_accand reducing redundant intermediate updatesK_CHUNK,Q_OUT_CHUNK,MLP_OUT_CHUNK,BATCH_TILE)Performance (profiling on
-p a2a3 --enable-profiling)Comparison baseline:
origin/mainversion of the same file (d91af0f).63044.26 us->3284.44 us(about 19.2x faster)63044.3 us->3284.4 us33411->1651(about 20.2x fewer tasks)3.93 us->44.58 us(higher per task, but far fewer tasks overall)Validation
git show d91af0f:examples/models/qwen3/qwen3_32b_decode_scope3.py > /tmp/scope3_base.pypython /tmp/scope3_base.py -p a2a3 -d 7 --enable-profilingpython examples/models/qwen3/qwen3_32b_decode_scope3.py -p a2a3 -d 7 --enable-profilingout: 131072/131072 matched)