Optimize Qwen3 scope1 decode performance by ndleslx · Pull Request #102 · hw-native-sys/pypto-lib

ndleslx · 2026-04-12T08:39:51Z

Summary

parallelize the RMS partial reduction and Q/K/V output chunk loops in qwen3_32b_decode_scope1.py
increase K_CHUNK from 128 to 512 and compute normalized chunks on demand instead of materializing the full normalized tile
on Ascend a2a3 device 1 with --runtime-profiling, reduce total test time from 530.62 us on origin/main to 412.72 us on 2main (22.2% faster)

Related Issues

coderabbitai · 2026-04-12T08:40:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Increases K_CHUNK from 128 to 512 and refactors the Scope 1 decode pipeline: Stage 1 computes per-chunk RMS squared-row partials and reduces them once; Stages 2 and 3 compute normalized chunks on-the-fly and use matmul/matmul_acc with parallelized output-block loops.

Changes

Cohort / File(s)	Summary
Qwen3 Scope1 decode refactor `examples/models/qwen3/qwen3_32b_decode_scope1.py`	Bumped `K_CHUNK` 128→512. Stage 1: compute `sq_partials` via chunked parallel loop and single reduction to `partial_sq`. Stage 2 (Q): compute `normed_chunk` on-the-fly, init `q_acc` with `matmul`, update with `matmul_acc`, parallelize `ob`. Stage 3 (K/V): analogous on-the-fly `normed_chunk`, update `k_acc`/`v_acc` in-place with `matmul_acc`, parallelize `kv_out_blocks`.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Refactor: Qwen3 decode with 3-scope architecture and TILELET rename #99: Modifies the same Qwen3 Scope 1 decode logic (K_CHUNK, RMS normalization, Q/K/V projection) with overlapping changes.
fix qwen3 decode #38: Related refactor of per-chunk RMS normalization and projection variable scoping in Qwen3 decode.
Refactor Qwen3 decode tilelet example to optimize tensor operations #28: Addresses RMS-squared accumulation and per-tile inv_rms changes in the Qwen3 decode pipeline.

Poem

🐰 Chunks leap up from 128 to 512,
I tally squares in tidy partial rows,
Normed chunks appear where needed, swift—
Matmuls hum and accumulators grow,
Hopping onward, decoding as I go.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: optimizing Qwen3 scope1 decode performance through scheduling improvements, parallelization, and chunk size optimization.
Description check	✅ Passed	The pull request description accurately describes the changeset, explaining the parallelization optimizations, K_CHUNK increase, and performance improvements achieved.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the qwen3_32b_decode_scope1.py example by increasing the K_CHUNK size and refactoring the RMSNorm and projection stages. The changes introduce parallelization and chunked loop optimization, fusing normalization steps directly into the Q, K, and V projection loops to improve memory efficiency and performance by eliminating large intermediate tensors. I have no feedback to provide.

- parallelize RMS partial reduction and Q/K/V output chunk loops - increase K_CHUNK to 512 and normalize chunks on demand to reduce wall time

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/models/qwen3/qwen3_32b_decode_scope1.py (1)

32-49: ⚠️ Potential issue | 🟠 Major

Validate hidden_size against the larger K_CHUNK.

Line 47 now truncates to hidden // 512, while Lines 99-107 and 130-140 still assume a full 512-wide chunk exists. That means any non-default hidden_size that's not a multiple of 512 will silently drop the tail in the compiled path, while golden_qwen3_scope1 still processes it via Lines 234-236. Please fail fast here or add tail handling before this lands.

Proposed guard

 def build_qwen3_scope1_program(
     batch: int = BATCH,
     hidden_size: int = HIDDEN,
     num_kv_heads: int = NUM_KV_HEADS,
     head_dim: int = HEAD_DIM,
 ):
     hidden = hidden_size
     kv_hidden = num_kv_heads * head_dim
+    if hidden % K_CHUNK != 0:
+        raise ValueError(
+            f"hidden_size ({hidden}) must be a multiple of K_CHUNK ({K_CHUNK})"
+        )
     hidden_blocks = hidden // K_CHUNK
     q_out_blocks = hidden // Q_OUT_CHUNK
     kv_out_blocks = kv_hidden // KV_OUT_CHUNK

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope1.py` around lines 32 - 49, The
build_qwen3_scope1_program currently computes hidden_blocks = hidden // K_CHUNK
which silently drops any hidden_size tail if hidden_size is not a multiple of
K_CHUNK; update build_qwen3_scope1_program to either (a) validate and fail fast
by checking hidden_size % K_CHUNK == 0 and raise a clear error (e.g.,
ValueError) referencing K_CHUNK, hidden_size and hidden_blocks, or (b) implement
explicit tail handling so the compiled path matches golden_qwen3_scope1 by
processing the final partial block (adjust q_out_blocks/kv_out_blocks/MLP
handling accordingly). Make the change within build_qwen3_scope1_program and
ensure all dependent computed names (hidden_blocks, q_out_blocks, kv_out_blocks,
KV_OUT_CHUNK, Q_OUT_CHUNK, MLP_OUT_CHUNK) are updated to reflect the validation
or tail case.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope1.py`:
- Around line 32-49: The build_qwen3_scope1_program currently computes
hidden_blocks = hidden // K_CHUNK which silently drops any hidden_size tail if
hidden_size is not a multiple of K_CHUNK; update build_qwen3_scope1_program to
either (a) validate and fail fast by checking hidden_size % K_CHUNK == 0 and
raise a clear error (e.g., ValueError) referencing K_CHUNK, hidden_size and
hidden_blocks, or (b) implement explicit tail handling so the compiled path
matches golden_qwen3_scope1 by processing the final partial block (adjust
q_out_blocks/kv_out_blocks/MLP handling accordingly). Make the change within
build_qwen3_scope1_program and ensure all dependent computed names
(hidden_blocks, q_out_blocks, kv_out_blocks, KV_OUT_CHUNK, Q_OUT_CHUNK,
MLP_OUT_CHUNK) are updated to reflect the validation or tail case.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8e0e54de-506a-431e-b932-046c5a3f13f9

📥 Commits

Reviewing files that changed from the base of the PR and between 5fce038 and 44a8863.

📒 Files selected for processing (1)

examples/models/qwen3/qwen3_32b_decode_scope1.py

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

Update: optimize Qwen3 scope1 decode scheduling

44a8863

- parallelize RMS partial reduction and Q/K/V output chunk loops - increase K_CHUNK to 512 and normalize chunks on demand to reduce wall time

ndleslx force-pushed the 2main branch from 5fce038 to 44a8863 Compare April 12, 2026 08:43

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

ndleslx changed the title ~~Optimize Qwen3 scope1 decode scheduling~~ Optimize Qwen3 scope1 decode performance Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Qwen3 scope1 decode performance#102

Optimize Qwen3 scope1 decode performance#102
ndleslx wants to merge 1 commit intohw-native-sys:mainfrom
ndleslx:2main

ndleslx commented Apr 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ndleslx commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ndleslx commented Apr 12, 2026 •

edited

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading