Update: chunk Qwen3 decode scope1 projections by ndleslx · Pull Request #104 · hw-native-sys/pypto-lib

ndleslx · 2026-04-13T03:42:24Z

Summary

Apply the wider scope1 reduction chunk and chunked Q/KV projection loops to both examples/models/qwen3/qwen3_32b_decode_scope1.py and the scope1 section of examples/models/qwen3/qwen3_32b_decode.py.
Keep the full decode scope3 path on its original K_CHUNK = 128 by introducing a scope1-specific chunk constant in qwen3_32b_decode.py.
Scope1 benchmark on a2a3 device 1 with runtime profiling: origin/main wall time 525.04 us and 161 tasks; updated branch wall time 350.02 us and 37 tasks.
Full decode benchmark on a2a3 device 1 with --max-seq --runtime-profiling: before change wall time 3198.22 us and 1503 tasks; after change wall time 3080.66 us and 1379 tasks.

Related Issues

[Perf] Split scope1 projection accumulation in Qwen3 decode example #81

coderabbitai · 2026-04-13T03:42:39Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Changed Scope‑1 tiling and loop structure in qwen3 decode: introduced a larger RMSNorm chunk size (512) for Scope‑1, and converted Q/K/V projection output-block loops to parallel/core-group scopes with chunked-loop optimization, relocating projection assembly into those scopes.

Changes

Cohort / File(s)	Summary
Scope‑1 decode adjustments `examples/models/qwen3/qwen3_32b_decode_scope1.py`	Increased `K_CHUNK` from `128` to `512` for RMSNorm and projection slicing; replaced sequential `pl.range(...)` output-block loops with `pl.at(..., optimization=pl.chunked_loop_optimizer)` containing `pl.parallel(..., chunk=4)` and moved `pl.assemble(...)` for Q/K/V inside the new parallel/core-group scope.
Scope‑split and golden path updates `examples/models/qwen3/qwen3_32b_decode.py`	Added `SCOPE1_K_CHUNK = 512` and retained `K_CHUNK = 128` for other scopes; updated Scope‑1 RMSNorm, Q/K/V tiling, and loop bounds to use `SCOPE1_K_CHUNK`; adjusted golden/reference RMSNorm chunking and renamed some Scope‑3 variables for clarity.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

[Perf] Split scope1 projection accumulation in Qwen3 decode example #81 — Similar edits to Scope‑1 Q/K/V projection tiling and where partial projection results are assembled.

Possibly related PRs

Refactor: Qwen3 decode with 3-scope architecture and TILELET rename #99 — Also modifies Scope‑1 tiling and restructures Q/K/V projection loops with core-group/parallel and chunking changes.

Poem

🐰 With wider chunks the tiles now sweep,

512 dreams in normalized keep,
Core groups hum and parallel play,
Projections gather where they used to stray,
A rabbit cheers code hopping on display 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: updating the Qwen3 decode scope1 projections with chunking optimizations.
Description check	✅ Passed	The pull request description clearly explains the changes: increasing K_CHUNK, applying chunked Q/KV projection loops, and includes benchmark results demonstrating performance improvements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request increases the K_CHUNK size and optimizes the Q, K, and V projection stages by wrapping loops in a chunked_loop_optimizer and converting them to parallel loops. A review comment points out that the pl.parallel calls should include an explicit start index to ensure compatibility with the DSL and avoid potential runtime errors.

gemini-code-assist · 2026-04-13T03:45:40Z

examples/models/qwen3/qwen3_32b_decode_scope1.py

-                for ob in pl.range(q_out_blocks):
-                    q0 = ob * Q_OUT_CHUNK
+                with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer):
+                    for ob in pl.parallel(q_out_blocks, chunk=4):


The pl.parallel function in this repository is consistently used with at least two positional arguments for the start and stop indices (e.g., pl.parallel(0, q_out_blocks, ...)), as seen in other model examples. Using a single argument may not be supported by the DSL and could lead to incorrect loop bounds or runtime errors.

Suggested change

for ob in pl.parallel(q_out_blocks, chunk=4):

for ob in pl.parallel(0, q_out_blocks, chunk=4):

Apply the larger scope1 reduction chunk and chunked Q/KV projection loops to both the standalone scope1 example and the full decode example. In the full decode path, keep scope3 chunking at 128 by introducing a scope1-specific chunk constant, so only scope1 uses the wider 512-way reduction and projection tiling. Benchmarks on a2a3 device 1 show lower task counts and lower end-to-end runtime for both the scope1-only path and the full decode path.

bumble0918 · 2026-04-13T07:17:40Z

examples/models/qwen3/qwen3_32b_decode.py


 # Scope 1 tiling constants.
-K_CHUNK = 128
+SCOPE1_K_CHUNK = 512


这名字有点怪

ndleslx added this to pto project Apr 13, 2026

ndleslx removed this from pto project Apr 13, 2026

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

ndleslx requested a review from zhangqi-chen April 13, 2026 03:46

ndleslx force-pushed the optmain branch from f58e541 to 72982a5 Compare April 13, 2026 06:35

ndleslx changed the title ~~Update: chunk Qwen3 scope1 decode projections~~ Update: chunk Qwen3 decode scope1 projections Apr 13, 2026

zhangqi-chen merged commit efce0d4 into hw-native-sys:main Apr 13, 2026
5 checks passed

bumble0918 reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update: chunk Qwen3 decode scope1 projections#104

Update: chunk Qwen3 decode scope1 projections#104
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
ndleslx:optmain

ndleslx commented Apr 13, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

Uh oh!

bumble0918 Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	for ob in pl.parallel(q_out_blocks, chunk=4):
	for ob in pl.parallel(0, q_out_blocks, chunk=4):

Conversation

ndleslx commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bumble0918 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ndleslx commented Apr 13, 2026 •

edited

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading