Skip to content

Optimize Qwen3 scope1 decode performance#102

Open
ndleslx wants to merge 1 commit intohw-native-sys:mainfrom
ndleslx:2main
Open

Optimize Qwen3 scope1 decode performance#102
ndleslx wants to merge 1 commit intohw-native-sys:mainfrom
ndleslx:2main

Conversation

@ndleslx
Copy link
Copy Markdown
Contributor

@ndleslx ndleslx commented Apr 12, 2026

Summary

  • parallelize the RMS partial reduction and Q/K/V output chunk loops in qwen3_32b_decode_scope1.py
  • increase K_CHUNK from 128 to 512 and compute normalized chunks on demand instead of materializing the full normalized tile
  • on Ascend a2a3 device 1 with --runtime-profiling, reduce total test time from 530.62 us on origin/main to 412.72 us on 2main (22.2% faster)

Related Issues

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Increases K_CHUNK from 128 to 512 and refactors the Scope 1 decode pipeline: Stage 1 computes per-chunk RMS squared-row partials and reduces them once; Stages 2 and 3 compute normalized chunks on-the-fly and use matmul/matmul_acc with parallelized output-block loops.

Changes

Cohort / File(s) Summary
Qwen3 Scope1 decode refactor
examples/models/qwen3/qwen3_32b_decode_scope1.py
Bumped K_CHUNK 128→512. Stage 1: compute sq_partials via chunked parallel loop and single reduction to partial_sq. Stage 2 (Q): compute normed_chunk on-the-fly, init q_acc with matmul, update with matmul_acc, parallelize ob. Stage 3 (K/V): analogous on-the-fly normed_chunk, update k_acc/v_acc in-place with matmul_acc, parallelize kv_out_blocks.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 Chunks leap up from 128 to 512,
I tally squares in tidy partial rows,
Normed chunks appear where needed, swift—
Matmuls hum and accumulators grow,
Hopping onward, decoding as I go.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: optimizing Qwen3 scope1 decode performance through scheduling improvements, parallelization, and chunk size optimization.
Description check ✅ Passed The pull request description accurately describes the changeset, explaining the parallelization optimizations, K_CHUNK increase, and performance improvements achieved.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the qwen3_32b_decode_scope1.py example by increasing the K_CHUNK size and refactoring the RMSNorm and projection stages. The changes introduce parallelization and chunked loop optimization, fusing normalization steps directly into the Q, K, and V projection loops to improve memory efficiency and performance by eliminating large intermediate tensors. I have no feedback to provide.

- parallelize RMS partial reduction and Q/K/V output chunk loops
- increase K_CHUNK to 512 and normalize chunks on demand to reduce wall time
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/models/qwen3/qwen3_32b_decode_scope1.py (1)

32-49: ⚠️ Potential issue | 🟠 Major

Validate hidden_size against the larger K_CHUNK.

Line 47 now truncates to hidden // 512, while Lines 99-107 and 130-140 still assume a full 512-wide chunk exists. That means any non-default hidden_size that's not a multiple of 512 will silently drop the tail in the compiled path, while golden_qwen3_scope1 still processes it via Lines 234-236. Please fail fast here or add tail handling before this lands.

Proposed guard
 def build_qwen3_scope1_program(
     batch: int = BATCH,
     hidden_size: int = HIDDEN,
     num_kv_heads: int = NUM_KV_HEADS,
     head_dim: int = HEAD_DIM,
 ):
     hidden = hidden_size
     kv_hidden = num_kv_heads * head_dim
+    if hidden % K_CHUNK != 0:
+        raise ValueError(
+            f"hidden_size ({hidden}) must be a multiple of K_CHUNK ({K_CHUNK})"
+        )
     hidden_blocks = hidden // K_CHUNK
     q_out_blocks = hidden // Q_OUT_CHUNK
     kv_out_blocks = kv_hidden // KV_OUT_CHUNK
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope1.py` around lines 32 - 49, The
build_qwen3_scope1_program currently computes hidden_blocks = hidden // K_CHUNK
which silently drops any hidden_size tail if hidden_size is not a multiple of
K_CHUNK; update build_qwen3_scope1_program to either (a) validate and fail fast
by checking hidden_size % K_CHUNK == 0 and raise a clear error (e.g.,
ValueError) referencing K_CHUNK, hidden_size and hidden_blocks, or (b) implement
explicit tail handling so the compiled path matches golden_qwen3_scope1 by
processing the final partial block (adjust q_out_blocks/kv_out_blocks/MLP
handling accordingly). Make the change within build_qwen3_scope1_program and
ensure all dependent computed names (hidden_blocks, q_out_blocks, kv_out_blocks,
KV_OUT_CHUNK, Q_OUT_CHUNK, MLP_OUT_CHUNK) are updated to reflect the validation
or tail case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope1.py`:
- Around line 32-49: The build_qwen3_scope1_program currently computes
hidden_blocks = hidden // K_CHUNK which silently drops any hidden_size tail if
hidden_size is not a multiple of K_CHUNK; update build_qwen3_scope1_program to
either (a) validate and fail fast by checking hidden_size % K_CHUNK == 0 and
raise a clear error (e.g., ValueError) referencing K_CHUNK, hidden_size and
hidden_blocks, or (b) implement explicit tail handling so the compiled path
matches golden_qwen3_scope1 by processing the final partial block (adjust
q_out_blocks/kv_out_blocks/MLP handling accordingly). Make the change within
build_qwen3_scope1_program and ensure all dependent computed names
(hidden_blocks, q_out_blocks, kv_out_blocks, KV_OUT_CHUNK, Q_OUT_CHUNK,
MLP_OUT_CHUNK) are updated to reflect the validation or tail case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8e0e54de-506a-431e-b932-046c5a3f13f9

📥 Commits

Reviewing files that changed from the base of the PR and between 5fce038 and 44a8863.

📒 Files selected for processing (1)
  • examples/models/qwen3/qwen3_32b_decode_scope1.py

@ndleslx ndleslx changed the title Optimize Qwen3 scope1 decode scheduling Optimize Qwen3 scope1 decode performance Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant