Skip to content

Update: chunk Qwen3 decode scope1 projections#104

Merged
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
ndleslx:optmain
Apr 13, 2026
Merged

Update: chunk Qwen3 decode scope1 projections#104
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
ndleslx:optmain

Conversation

@ndleslx
Copy link
Copy Markdown
Contributor

@ndleslx ndleslx commented Apr 13, 2026

Summary

  • Apply the wider scope1 reduction chunk and chunked Q/KV projection loops to both examples/models/qwen3/qwen3_32b_decode_scope1.py and the scope1 section of examples/models/qwen3/qwen3_32b_decode.py.
  • Keep the full decode scope3 path on its original K_CHUNK = 128 by introducing a scope1-specific chunk constant in qwen3_32b_decode.py.
  • Scope1 benchmark on a2a3 device 1 with runtime profiling: origin/main wall time 525.04 us and 161 tasks; updated branch wall time 350.02 us and 37 tasks.
  • Full decode benchmark on a2a3 device 1 with --max-seq --runtime-profiling: before change wall time 3198.22 us and 1503 tasks; after change wall time 3080.66 us and 1379 tasks.

Related Issues

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Changed Scope‑1 tiling and loop structure in qwen3 decode: introduced a larger RMSNorm chunk size (512) for Scope‑1, and converted Q/K/V projection output-block loops to parallel/core-group scopes with chunked-loop optimization, relocating projection assembly into those scopes.

Changes

Cohort / File(s) Summary
Scope‑1 decode adjustments
examples/models/qwen3/qwen3_32b_decode_scope1.py
Increased K_CHUNK from 128 to 512 for RMSNorm and projection slicing; replaced sequential pl.range(...) output-block loops with pl.at(..., optimization=pl.chunked_loop_optimizer) containing pl.parallel(..., chunk=4) and moved pl.assemble(...) for Q/K/V inside the new parallel/core-group scope.
Scope‑split and golden path updates
examples/models/qwen3/qwen3_32b_decode.py
Added SCOPE1_K_CHUNK = 512 and retained K_CHUNK = 128 for other scopes; updated Scope‑1 RMSNorm, Q/K/V tiling, and loop bounds to use SCOPE1_K_CHUNK; adjusted golden/reference RMSNorm chunking and renamed some Scope‑3 variables for clarity.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 With wider chunks the tiles now sweep,

512 dreams in normalized keep,
Core groups hum and parallel play,
Projections gather where they used to stray,
A rabbit cheers code hopping on display 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: updating the Qwen3 decode scope1 projections with chunking optimizations.
Description check ✅ Passed The pull request description clearly explains the changes: increasing K_CHUNK, applying chunked Q/KV projection loops, and includes benchmark results demonstrating performance improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the K_CHUNK size and optimizes the Q, K, and V projection stages by wrapping loops in a chunked_loop_optimizer and converting them to parallel loops. A review comment points out that the pl.parallel calls should include an explicit start index to ensure compatibility with the DSL and avoid potential runtime errors.

for ob in pl.range(q_out_blocks):
q0 = ob * Q_OUT_CHUNK
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer):
for ob in pl.parallel(q_out_blocks, chunk=4):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pl.parallel function in this repository is consistently used with at least two positional arguments for the start and stop indices (e.g., pl.parallel(0, q_out_blocks, ...)), as seen in other model examples. Using a single argument may not be supported by the DSL and could lead to incorrect loop bounds or runtime errors.

Suggested change
for ob in pl.parallel(q_out_blocks, chunk=4):
for ob in pl.parallel(0, q_out_blocks, chunk=4):

@ndleslx ndleslx requested a review from zhangqi-chen April 13, 2026 03:46
Apply the larger scope1 reduction chunk and chunked Q/KV projection
loops to both the standalone scope1 example and the full decode
example.

In the full decode path, keep scope3 chunking at 128 by introducing a
scope1-specific chunk constant, so only scope1 uses the wider 512-way
reduction and projection tiling.

Benchmarks on a2a3 device 1 show lower task counts and lower end-to-end
runtime for both the scope1-only path and the full decode path.
@ndleslx ndleslx changed the title Update: chunk Qwen3 scope1 decode projections Update: chunk Qwen3 decode scope1 projections Apr 13, 2026
@zhangqi-chen zhangqi-chen merged commit efce0d4 into hw-native-sys:main Apr 13, 2026
5 checks passed

# Scope 1 tiling constants.
K_CHUNK = 128
SCOPE1_K_CHUNK = 512
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这名字有点怪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants