Skip to content

Qwen3 Scope3 refactor with performance improvements#90

Merged
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
high-cloud:codex/tidy-qwen3-scope3
Apr 10, 2026
Merged

Qwen3 Scope3 refactor with performance improvements#90
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
high-cloud:codex/tidy-qwen3-scope3

Conversation

@high-cloud
Copy link
Copy Markdown
Contributor

@high-cloud high-cloud commented Apr 10, 2026

Summary

  • refactor examples/models/qwen3/qwen3_32b_decode_scope3.py into clearer stage-by-stage structure
  • optimize projection accumulation paths by switching to matmul + matmul_acc and reducing redundant intermediate updates
  • keep chunk constant set intact (K_CHUNK, Q_OUT_CHUNK, MLP_OUT_CHUNK, BATCH_TILE)
  • clean up runtime entry (CLI options and backend selection) and minor dead code in golden path

Performance (profiling on -p a2a3 --enable-profiling)

Comparison baseline: origin/main version of the same file (d91af0f).

  • Total Test Time: 63044.26 us -> 3284.44 us (about 19.2x faster)
  • Wall-clock: 63044.3 us -> 3284.4 us
  • Total tasks: 33411 -> 1651 (about 20.2x fewer tasks)
  • Per-task latency: 3.93 us -> 44.58 us (higher per task, but far fewer tasks overall)
  • End-to-end outcome: substantial runtime reduction from task graph simplification

Validation

  • Baseline profiling:
    • git show d91af0f:examples/models/qwen3/qwen3_32b_decode_scope3.py > /tmp/scope3_base.py
    • python /tmp/scope3_base.py -p a2a3 -d 7 --enable-profiling
  • Current branch profiling:
    • python examples/models/qwen3/qwen3_32b_decode_scope3.py -p a2a3 -d 7 --enable-profiling
  • Correctness check:
    • current branch PASS (out: 131072/131072 matched)

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated Qwen3 32B scope-3 decode kernel: HIDDEN increased 5120→8192, output-projection/residual accumulation postponed until after full per-tile projection, post-attention RMSNorm refactored with intermediate BF16 cast, and MLP down-projection reworked to use an mlp_tile buffer with matmul-driven accumulation.

Changes

Cohort / File(s) Summary
Qwen3 32B Scope-3 Decode Kernel
examples/models/qwen3/qwen3_32b_decode_scope3.py
HIDDEN constant increased to 8192 (adjusted tiling/block counts); output-projection now accumulates full tile before adding residual (single resid1_tile assemble); post-attention RMSNorm hoists BF16 cast into intermediate normed_bf16 before assemble; replaces down_proj_tile accumulator with mlp_tile buffer and computes down_acc via matmul then adds into out; stage numbering and residual writeback simplified accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • hw-native-sys/pypto-lib issue 70: Touches the same scope-3 output-projection/MLP/down-projection and residual accumulation logic; relevant to the tiling and accumulation changes.

Possibly related PRs

  • hw-native-sys/pypto-lib PR 88: Makes opposing edits to the same scope-3 decode kernel (HIDDEN size, output-projection/residual, MLP/down-projection).
  • hw-native-sys/pypto-lib PR 89: Shares the BF16 cast hoisting refactor pattern used in the post-attention norm.
  • hw-native-sys/pypto-lib PR 65: Modifies the same scope-3 decode implementation with overlapping changes to tiling, residual handling, BF16 casts, and MLP down-projection.

Poem

🐰 In tiles and sums I hop and play,
Hidden widths now stretch their way,
Residuals wait till projections sing,
MLPs gather, matmuls bring—
A tiny rabbit cheers the new decay.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: refactoring the Qwen3 Scope3 decode file with performance optimizations through improved projection accumulation paths and simplified control flow.
Description check ✅ Passed The description is directly related to the changeset, detailing the refactoring objectives, optimization approach, performance metrics, and validation methodology for the Qwen3 Scope3 file changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py (1)

73-73: Consider using ASCII character for multiplication.

The comment uses a Unicode multiplication sign (×, U+00D7) which may cause issues with text search tools. Consider using ASCII x or * for better searchability.

-                # Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK.
+                # Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 73, Replace the
Unicode multiplication sign in the comment "Stage 1: Output projection: attn_out
× wo, tiled by Q_OUT_CHUNK." with an ASCII character (use either "x" or "*") so
text search and tooling won't be disrupted; update the comment to something like
"Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK." to locate it
by the same phrase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 73: Replace the Unicode multiplication sign in the comment "Stage 1:
Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII character
(use either "x" or "*") so text search and tooling won't be disrupted; update
the comment to something like "Stage 1: Output projection: attn_out x wo, tiled
by Q_OUT_CHUNK." to locate it by the same phrase.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 244499f1-0dc6-46c3-a04b-e9762df4b99e

📥 Commits

Reviewing files that changed from the base of the PR and between d91af0f and 0643f2a.

📒 Files selected for processing (1)
  • examples/models/qwen3/qwen3_32b_decode_scope3.py

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the HIDDEN dimension to 8192 and refactors the decoding pipeline in qwen3_32b_decode_scope3.py, including changes to residual addition and the introduction of an mlp_tile for intermediate MLP results. However, the feedback highlights several critical memory constraints: the increased hidden size exceeds hardware buffer limits given the current BATCH_TILE setting, and the use of pl.incore() in multiple stages is likely to cause overflows as the tensors are too large for local memory. It is recommended to revert to pl.auto_incore() and adjust the tiling strategy to stay within hardware limits.


BATCH = 16
HIDDEN = 5120
HIDDEN = 8192
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Increasing HIDDEN to 8192 while maintaining BATCH_TILE = 16 (line 33) will likely cause a memory overflow in the on-chip vector buffer. For instance, resid1_tile (FP32) now requires $16 \times 8192 \times 4 = 512$ KB, which exceeds the 248 KB limit mentioned in the qwen3_tilelet.md documentation. This is consistent with the 507018 runtime error reported in the PR description. Consider reducing BATCH_TILE to 4 to stay within hardware constraints, as seen in the main qwen3_32b_decode.py file.

post_norm_tile = pl.create_tensor([BATCH_TILE, HIDDEN_CFG], dtype=pl.BF16)
down_proj_tile = pl.create_tensor([BATCH_TILE, HIDDEN_CFG], dtype=pl.FP32)
with pl.auto_incore():
with pl.incore():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Switching from pl.auto_incore() to pl.incore() here is risky. pl.incore() asserts that the entire block's data fits in local memory. Given that resid1_tile (512 KB) and post_norm_tile (256 KB) are both active in this scope and exceed the 248 KB hardware limit, this will likely cause a compilation or runtime failure. It is safer to use pl.auto_incore() to allow the compiler to automatically handle tiling and data movement between global and local memory.

Suggested change
with pl.incore():
with pl.auto_incore():

post_norm_tile = pl.assemble(post_norm_tile, normed_bf16, [0, k0])

# Stage 4: Initialize mlp_tile
mlp_tile = pl.create_tensor([BATCH_TILE, INTER_CFG], dtype=pl.BF16)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The mlp_tile tensor of size [BATCH_TILE, INTER_CFG] (approx. 800 KB for BATCH_TILE=16) is too large to be allocated as a local tensor using pl.create_tensor. This will lead to a memory overflow on the device. Instead of materializing the full intermediate MLP output in local memory, consider a tiling strategy that processes chunks of the intermediate dimension and accumulates them directly into the down-projection output, or ensure the tensor is placed in global memory if the DSL supports it.

# Stage 8 & 9: Down projection + final residual writeback.
for dob in pl.range(HIDDEN_BLOCKS):
d0 = dob * K_CHUNK
with pl.incore():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using pl.incore() inside the dob loop is problematic because it assumes the sliced tensors (mlp_tile, resid1_tile) are resident in local memory. Given their sizes, they cannot be fully resident. If the intention was to slice from global memory, pl.auto_incore() should be used to allow the compiler to manage the necessary DMA transfers and tiling.

Suggested change
with pl.incore():
with pl.auto_incore():

- rewrite scope3 into explicit stage-by-stage sections with clearer dataflow\n- switch projection accumulations to matmul/matmul_acc and streamline residual assembly\n- update default shape baseline (HIDDEN=8192) and keep existing chunk constants\n- refresh tensor initializers and remove dead local from golden\n- expose CLI platform/device/profiling options and backend selection in compile_and_run\n- validate on a2a3sim: out matched 131072/131072
@high-cloud high-cloud force-pushed the codex/tidy-qwen3-scope3 branch from 0643f2a to f2afe11 Compare April 10, 2026 02:52
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py (1)

73-73: Consider replacing Unicode multiplication sign with ASCII x.

Static analysis flagged the ambiguous × (MULTIPLICATION SIGN) character. While harmless, using ASCII x improves searchability and cross-editor compatibility.

📝 Suggested fix
-                # Stage 1: Output projection: attn_out × wo, tiled by Q_OUT_CHUNK.
+                # Stage 1: Output projection: attn_out x wo, tiled by Q_OUT_CHUNK.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 73, Replace the
Unicode multiplication sign in the comment "Stage 1: Output projection: attn_out
× wo, tiled by Q_OUT_CHUNK." with an ASCII "x" to improve searchability and
editor compatibility; locate the comment near the output projection description
referencing symbols attn_out, wo and Q_OUT_CHUNK in qwen3_32b_decode_scope3.py
and update it to read "attn_out x wo" (or "attn_out x wo, tiled by
Q_OUT_CHUNK.") without changing surrounding code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 73: Replace the Unicode multiplication sign in the comment "Stage 1:
Output projection: attn_out × wo, tiled by Q_OUT_CHUNK." with an ASCII "x" to
improve searchability and editor compatibility; locate the comment near the
output projection description referencing symbols attn_out, wo and Q_OUT_CHUNK
in qwen3_32b_decode_scope3.py and update it to read "attn_out x wo" (or
"attn_out x wo, tiled by Q_OUT_CHUNK.") without changing surrounding code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b1bde245-1fab-4e89-ad65-a1934bc51af4

📥 Commits

Reviewing files that changed from the base of the PR and between 0643f2a and f2afe11.

📒 Files selected for processing (1)
  • examples/models/qwen3/qwen3_32b_decode_scope3.py

@zhangqi-chen zhangqi-chen merged commit 4f729f3 into hw-native-sys:main Apr 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants