Objective
Improve the end-to-end performance and code quality of examples/models/qwen3/qwen3_32b_decode_tilelet.py, targeting reduced GM round-trips, better on-chip utilisation, and cleaner code patterns across all three pl.auto_incore() / pl.incore() scopes.
Scope-by-Scope Analysis
Scope 1 — Input RMSNorm + Q/K/V Projection (lines ~152–233)
Performance
- Zero-init overhead: sequential loops (lines 153–166) initialise
q_proj, attn_out, k_proj, v_proj one chunk at a time. Consider using pl.parallel or eliminating the zero-init by handling the first matmul iteration differently.
- K/V projection separation: K and V share the same
normed_tile input but are computed in separate pl.incore() blocks (lines 212–233). Fusing them into a single incore region would halve the normed_tile GM reads.
- Manual incore: Scope 1 uses explicit
pl.incore() instead of pl.auto_incore(), missing potential compiler optimisations for buffer placement and scheduling.
Code quality
- RMSNorm missing
rsqrt: lines 184–196 compute variance = mean(x²) + eps but then multiply x_chunk by variance directly instead of rsqrt(variance). Compare with Scope 3 (line 422–423) which correctly uses pl.rsqrt. The golden reference has the same bug, so results match — but the math is wrong for actual RMSNorm.
- Docstring stale: module docstring (line 27) still references
BATCH_TILE=4 but the constant was changed to 16 (line 89).
Scope 2 — RoPE + KV Cache Update + Decode Attention (lines ~240–390)
Performance
- Excessive GM round-trips in attention loop: each
ctx_blocks iteration has 4 separate pl.incore() stages (QK matmul → softmax → SV matmul → online rescale, lines 328–381). Intermediate tensors raw_scores_pad, exp_padded, oi_tmp_pad are written to GM and read back between stages. Fusing stages would eliminate these round-trips.
- Padded matmul waste:
q_padded is [Q_HEAD_PAD=16, HEAD_DIM=128] but only Q_HEAD_BATCH=8 rows are valid. The QK and SV matmuls compute 2× the necessary rows then discard half.
- Consider
matmul_acc: QK matmul (line 335) and SV matmul (line 364) could benefit from matmul_acc pattern if the loop structure is refactored.
Code quality
raw_scores_pad write-then-read: raw_scores_pad is created as a GM tensor (line 327), written via matmul (line 335), then sliced for softmax (line 343). This explicit GM tensor could be avoided if the stages were fused.
Scope 3 — Output Projection + Post-RMSNorm + MLP + Residual (lines ~393–469)
Performance
- Output projection pattern: line 409 uses
pl.add(o_acc, pl.matmul(a_chunk, w_chunk)) instead of pl.matmul_acc, losing potential hardware accumulation.
- Gate/Up shared reads: MLP gate and up projections (lines 443–449) both read
post_chunk from the same slice of post_norm_tile but accumulate separately. Consider whether these can share the load.
- Down projection inner parallelism:
dob loop (line 455) uses chunk=4; profiling may show a different chunk size is better for the memory access pattern.
Code quality
- Already uses
pl.auto_incore(split=pl.SplitMode.UP_DOWN) ✓ — cleanest scope.
Cross-Scope Opportunities
- Scope 1 → Scope 2 GM traffic:
q_proj / k_proj / v_proj are written to GM in Scope 1 and read back in Scope 2. Pipelining or merging these scopes could reduce total GM bandwidth.
- Consistent matmul pattern: standardise on
matmul + matmul_acc (as Scope 1 already does for Q/K/V) across all scopes.
File
examples/models/qwen3/qwen3_32b_decode_tilelet.py
Objective
Improve the end-to-end performance and code quality of
examples/models/qwen3/qwen3_32b_decode_tilelet.py, targeting reduced GM round-trips, better on-chip utilisation, and cleaner code patterns across all threepl.auto_incore()/pl.incore()scopes.Scope-by-Scope Analysis
Scope 1 — Input RMSNorm + Q/K/V Projection (lines ~152–233)
Performance
q_proj,attn_out,k_proj,v_projone chunk at a time. Consider usingpl.parallelor eliminating the zero-init by handling the first matmul iteration differently.normed_tileinput but are computed in separatepl.incore()blocks (lines 212–233). Fusing them into a single incore region would halve thenormed_tileGM reads.pl.incore()instead ofpl.auto_incore(), missing potential compiler optimisations for buffer placement and scheduling.Code quality
rsqrt: lines 184–196 computevariance = mean(x²) + epsbut then multiplyx_chunkbyvariancedirectly instead ofrsqrt(variance). Compare with Scope 3 (line 422–423) which correctly usespl.rsqrt. The golden reference has the same bug, so results match — but the math is wrong for actual RMSNorm.BATCH_TILE=4but the constant was changed to 16 (line 89).Scope 2 — RoPE + KV Cache Update + Decode Attention (lines ~240–390)
Performance
ctx_blocksiteration has 4 separatepl.incore()stages (QK matmul → softmax → SV matmul → online rescale, lines 328–381). Intermediate tensorsraw_scores_pad,exp_padded,oi_tmp_padare written to GM and read back between stages. Fusing stages would eliminate these round-trips.q_paddedis[Q_HEAD_PAD=16, HEAD_DIM=128]but onlyQ_HEAD_BATCH=8rows are valid. The QK and SV matmuls compute 2× the necessary rows then discard half.matmul_acc: QK matmul (line 335) and SV matmul (line 364) could benefit frommatmul_accpattern if the loop structure is refactored.Code quality
raw_scores_padwrite-then-read:raw_scores_padis created as a GM tensor (line 327), written via matmul (line 335), then sliced for softmax (line 343). This explicit GM tensor could be avoided if the stages were fused.Scope 3 — Output Projection + Post-RMSNorm + MLP + Residual (lines ~393–469)
Performance
pl.add(o_acc, pl.matmul(a_chunk, w_chunk))instead ofpl.matmul_acc, losing potential hardware accumulation.post_chunkfrom the same slice ofpost_norm_tilebut accumulate separately. Consider whether these can share the load.dobloop (line 455) useschunk=4; profiling may show a different chunk size is better for the memory access pattern.Code quality
pl.auto_incore(split=pl.SplitMode.UP_DOWN)✓ — cleanest scope.Cross-Scope Opportunities
q_proj/k_proj/v_projare written to GM in Scope 1 and read back in Scope 2. Pipelining or merging these scopes could reduce total GM bandwidth.matmul+matmul_acc(as Scope 1 already does for Q/K/V) across all scopes.File
examples/models/qwen3/qwen3_32b_decode_tilelet.py