Qwen3 Scope3 refactor with performance improvements by high-cloud · Pull Request #90 · hw-native-sys/pypto-lib

high-cloud · 2026-04-10T02:10:45Z

Summary

refactor examples/models/qwen3/qwen3_32b_decode_scope3.py into clearer stage-by-stage structure
optimize projection accumulation paths by switching to matmul + matmul_acc and reducing redundant intermediate updates
keep chunk constant set intact (K_CHUNK, Q_OUT_CHUNK, MLP_OUT_CHUNK, BATCH_TILE)
clean up runtime entry (CLI options and backend selection) and minor dead code in golden path

Performance (profiling on `-p a2a3 --enable-profiling`)

Comparison baseline: origin/main version of the same file (d91af0f).

Total Test Time: 63044.26 us -> 3284.44 us (about 19.2x faster)
Wall-clock: 63044.3 us -> 3284.4 us
Total tasks: 33411 -> 1651 (about 20.2x fewer tasks)
Per-task latency: 3.93 us -> 44.58 us (higher per task, but far fewer tasks overall)
End-to-end outcome: substantial runtime reduction from task graph simplification

Validation

Baseline profiling:
- git show d91af0f:examples/models/qwen3/qwen3_32b_decode_scope3.py > /tmp/scope3_base.py
- python /tmp/scope3_base.py -p a2a3 -d 7 --enable-profiling
Current branch profiling:
- python examples/models/qwen3/qwen3_32b_decode_scope3.py -p a2a3 -d 7 --enable-profiling
Correctness check:
- current branch PASS (out: 131072/131072 matched)

coderabbitai · 2026-04-10T02:11:01Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Updated Qwen3 32B scope-3 decode kernel: HIDDEN increased 5120→8192, output-projection/residual accumulation postponed until after full per-tile projection, post-attention RMSNorm refactored with intermediate BF16 cast, and MLP down-projection reworked to use an mlp_tile buffer with matmul-driven accumulation.

Changes

Cohort / File(s)	Summary
Qwen3 32B Scope-3 Decode Kernel `examples/models/qwen3/qwen3_32b_decode_scope3.py`	HIDDEN constant increased to 8192 (adjusted tiling/block counts); output-projection now accumulates full tile before adding residual (single `resid1_tile` assemble); post-attention RMSNorm hoists BF16 cast into intermediate `normed_bf16` before assemble; replaces `down_proj_tile` accumulator with `mlp_tile` buffer and computes `down_acc` via matmul then adds into `out`; stage numbering and residual writeback simplified accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

hw-native-sys/pypto-lib issue 70: Touches the same scope-3 output-projection/MLP/down-projection and residual accumulation logic; relevant to the tiling and accumulation changes.

Possibly related PRs

hw-native-sys/pypto-lib PR 88: Makes opposing edits to the same scope-3 decode kernel (HIDDEN size, output-projection/residual, MLP/down-projection).
hw-native-sys/pypto-lib PR 89: Shares the BF16 cast hoisting refactor pattern used in the post-attention norm.
hw-native-sys/pypto-lib PR 65: Modifies the same scope-3 decode implementation with overlapping changes to tiling, residual handling, BF16 casts, and MLP down-projection.

Poem

🐰 In tiles and sums I hop and play,
Hidden widths now stretch their way,
Residuals wait till projections sing,
MLPs gather, matmuls bring—
A tiny rabbit cheers the new decay.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: refactoring the Qwen3 Scope3 decode file with performance optimizations through improved projection accumulation paths and simplified control flow.
Description check	✅ Passed	The description is directly related to the changeset, detailing the refactoring objectives, optimization approach, performance metrics, and validation methodology for the Qwen3 Scope3 file changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

examples/models/qwen3/qwen3_32b_decode_scope3.py (1)

73-73: Consider using ASCII character for multiplication.

The comment uses a Unicode multiplication sign (×, U+00D7) which may cause issues with text search tools. Consider using ASCII x or * for better searchability.

-                # Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK.
+                # Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 73, Replace the
Unicode multiplication sign in the comment "Stage 1: Output projection: attn_out
× wo, tiled by Q_OUT_CHUNK." with an ASCII character (use either "x" or "*") so
text search and tooling won't be disrupted; update the comment to something like
"Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK." to locate it
by the same phrase.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 73: Replace the Unicode multiplication sign in the comment "Stage 1:
Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII character
(use either "x" or "*") so text search and tooling won't be disrupted; update
the comment to something like "Stage 1: Output projection: attn_out x wo, tiled
by Q_OUT_CHUNK." to locate it by the same phrase.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 244499f1-0dc6-46c3-a04b-e9762df4b99e

📥 Commits

Reviewing files that changed from the base of the PR and between d91af0f and 0643f2a.

📒 Files selected for processing (1)

examples/models/qwen3/qwen3_32b_decode_scope3.py

gemini-code-assist

Code Review

This pull request increases the HIDDEN dimension to 8192 and refactors the decoding pipeline in qwen3_32b_decode_scope3.py, including changes to residual addition and the introduction of an mlp_tile for intermediate MLP results. However, the feedback highlights several critical memory constraints: the increased hidden size exceeds hardware buffer limits given the current BATCH_TILE setting, and the use of pl.incore() in multiple stages is likely to cause overflows as the tensors are too large for local memory. It is recommended to revert to pl.auto_incore() and adjust the tiling strategy to stay within hardware limits.

gemini-code-assist · 2026-04-10T02:14:20Z

examples/models/qwen3/qwen3_32b_decode_scope3.py


 BATCH = 16
-HIDDEN = 5120
+HIDDEN = 8192


Increasing HIDDEN to 8192 while maintaining BATCH_TILE = 16 (line 33) will likely cause a memory overflow in the on-chip vector buffer. For instance, resid1_tile (FP32) now requires $16 \times 8192 \times 4 = 512$ KB, which exceeds the 248 KB limit mentioned in the qwen3_tilelet.md documentation. This is consistent with the 507018 runtime error reported in the PR description. Consider reducing BATCH_TILE to 4 to stay within hardware constraints, as seen in the main qwen3_32b_decode.py file.

gemini-code-assist · 2026-04-10T02:14:20Z

examples/models/qwen3/qwen3_32b_decode_scope3.py

                post_norm_tile = pl.create_tensor([BATCH_TILE, HIDDEN_CFG], dtype=pl.BF16)
-                down_proj_tile = pl.create_tensor([BATCH_TILE, HIDDEN_CFG], dtype=pl.FP32)
-                with pl.auto_incore():
+                with pl.incore():


Switching from pl.auto_incore() to pl.incore() here is risky. pl.incore() asserts that the entire block's data fits in local memory. Given that resid1_tile (512 KB) and post_norm_tile (256 KB) are both active in this scope and exceed the 248 KB hardware limit, this will likely cause a compilation or runtime failure. It is safer to use pl.auto_incore() to allow the compiler to automatically handle tiling and data movement between global and local memory.

Suggested change

with pl.incore():

with pl.auto_incore():

gemini-code-assist · 2026-04-10T02:14:20Z

examples/models/qwen3/qwen3_32b_decode_scope3.py

+                        post_norm_tile = pl.assemble(post_norm_tile, normed_bf16, [0, k0])
+
+                # Stage 4:  Initialize mlp_tile
+                mlp_tile = pl.create_tensor([BATCH_TILE, INTER_CFG], dtype=pl.BF16)


The mlp_tile tensor of size [BATCH_TILE, INTER_CFG] (approx. 800 KB for BATCH_TILE=16) is too large to be allocated as a local tensor using pl.create_tensor. This will lead to a memory overflow on the device. Instead of materializing the full intermediate MLP output in local memory, consider a tiling strategy that processes chunks of the intermediate dimension and accumulates them directly into the down-projection output, or ensure the tensor is placed in global memory if the DSL supports it.

gemini-code-assist · 2026-04-10T02:14:20Z

examples/models/qwen3/qwen3_32b_decode_scope3.py

+                # Stage 8 & 9: Down projection + final residual writeback.
+                for dob in pl.range(HIDDEN_BLOCKS):
+                    d0 = dob * K_CHUNK
+                    with pl.incore():


Using pl.incore() inside the dob loop is problematic because it assumes the sliced tensors (mlp_tile, resid1_tile) are resident in local memory. Given their sizes, they cannot be fully resident. If the intention was to slice from global memory, pl.auto_incore() should be used to allow the compiler to manage the necessary DMA transfers and tiling.

Suggested change

with pl.incore():

with pl.auto_incore():

- rewrite scope3 into explicit stage-by-stage sections with clearer dataflow\n- switch projection accumulations to matmul/matmul_acc and streamline residual assembly\n- update default shape baseline (HIDDEN=8192) and keep existing chunk constants\n- refresh tensor initializers and remove dead local from golden\n- expose CLI platform/device/profiling options and backend selection in compile_and_run\n- validate on a2a3sim: out matched 131072/131072

coderabbitai

🧹 Nitpick comments (1)

examples/models/qwen3/qwen3_32b_decode_scope3.py (1)

73-73: Consider replacing Unicode multiplication sign with ASCII x.

Static analysis flagged the ambiguous × (MULTIPLICATION SIGN) character. While harmless, using ASCII x improves searchability and cross-editor compatibility.

📝 Suggested fix

-                # Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK.
+                # Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 73, Replace the
Unicode multiplication sign in the comment "Stage 1: Output projection: attn_out
× wo, tiled by Q_OUT_CHUNK." with an ASCII "x" to improve searchability and
editor compatibility; locate the comment near the output projection description
referencing symbols attn_out, wo and Q_OUT_CHUNK in qwen3_32b_decode_scope3.py
and update it to read "attn_out x wo" (or "attn_out x wo, tiled by
Q_OUT_CHUNK.") without changing surrounding code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 73: Replace the Unicode multiplication sign in the comment "Stage 1:
Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII "x" to
improve searchability and editor compatibility; locate the comment near the
output projection description referencing symbols attn_out, wo and Q_OUT_CHUNK
in qwen3_32b_decode_scope3.py and update it to read "attn_out x wo" (or
"attn_out x wo, tiled by Q_OUT_CHUNK.") without changing surrounding code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b1bde245-1fab-4e89-ad65-a1934bc51af4

📥 Commits

Reviewing files that changed from the base of the PR and between 0643f2a and f2afe11.

📒 Files selected for processing (1)

examples/models/qwen3/qwen3_32b_decode_scope3.py

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

high-cloud force-pushed the codex/tidy-qwen3-scope3 branch from 0643f2a to f2afe11 Compare April 10, 2026 02:52

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

zhangqi-chen merged commit 4f729f3 into hw-native-sys:main Apr 10, 2026
5 checks passed

This was referenced Apr 10, 2026

[codex] remove init stages in qwen3 scope3 #92

Merged

Optimize Qwen3 scope1 decode performance #102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3 Scope3 refactor with performance improvements#90

Qwen3 Scope3 refactor with performance improvements#90
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
high-cloud:codex/tidy-qwen3-scope3

high-cloud commented Apr 10, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 10, 2026

Uh oh!

gemini-code-assist bot Apr 10, 2026

Uh oh!

gemini-code-assist bot Apr 10, 2026

Uh oh!

gemini-code-assist bot Apr 10, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

high-cloud commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (profiling on -p a2a3 --enable-profiling)

Validation

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

high-cloud commented Apr 10, 2026 •

edited

Loading

Performance (profiling on `-p a2a3 --enable-profiling`)

coderabbitai bot commented Apr 10, 2026 •

edited

Loading