Skip to content

[Perf] qwen3_32b_decode_tilelet — Performance & Code Improvement #70

@zhangqi-chen

Description

@zhangqi-chen

Objective

Improve the end-to-end performance and code quality of examples/models/qwen3/qwen3_32b_decode_tilelet.py, targeting reduced GM round-trips, better on-chip utilisation, and cleaner code patterns across all three pl.auto_incore() / pl.incore() scopes.

Scope-by-Scope Analysis

Scope 1 — Input RMSNorm + Q/K/V Projection (lines ~152–233)

Performance

  • Zero-init overhead: sequential loops (lines 153–166) initialise q_proj, attn_out, k_proj, v_proj one chunk at a time. Consider using pl.parallel or eliminating the zero-init by handling the first matmul iteration differently.
  • K/V projection separation: K and V share the same normed_tile input but are computed in separate pl.incore() blocks (lines 212–233). Fusing them into a single incore region would halve the normed_tile GM reads.
  • Manual incore: Scope 1 uses explicit pl.incore() instead of pl.auto_incore(), missing potential compiler optimisations for buffer placement and scheduling.

Code quality

  • RMSNorm missing rsqrt: lines 184–196 compute variance = mean(x²) + eps but then multiply x_chunk by variance directly instead of rsqrt(variance). Compare with Scope 3 (line 422–423) which correctly uses pl.rsqrt. The golden reference has the same bug, so results match — but the math is wrong for actual RMSNorm.
  • Docstring stale: module docstring (line 27) still references BATCH_TILE=4 but the constant was changed to 16 (line 89).

Scope 2 — RoPE + KV Cache Update + Decode Attention (lines ~240–390)

Performance

  • Excessive GM round-trips in attention loop: each ctx_blocks iteration has 4 separate pl.incore() stages (QK matmul → softmax → SV matmul → online rescale, lines 328–381). Intermediate tensors raw_scores_pad, exp_padded, oi_tmp_pad are written to GM and read back between stages. Fusing stages would eliminate these round-trips.
  • Padded matmul waste: q_padded is [Q_HEAD_PAD=16, HEAD_DIM=128] but only Q_HEAD_BATCH=8 rows are valid. The QK and SV matmuls compute 2× the necessary rows then discard half.
  • Consider matmul_acc: QK matmul (line 335) and SV matmul (line 364) could benefit from matmul_acc pattern if the loop structure is refactored.

Code quality

  • raw_scores_pad write-then-read: raw_scores_pad is created as a GM tensor (line 327), written via matmul (line 335), then sliced for softmax (line 343). This explicit GM tensor could be avoided if the stages were fused.

Scope 3 — Output Projection + Post-RMSNorm + MLP + Residual (lines ~393–469)

Performance

  • Output projection pattern: line 409 uses pl.add(o_acc, pl.matmul(a_chunk, w_chunk)) instead of pl.matmul_acc, losing potential hardware accumulation.
  • Gate/Up shared reads: MLP gate and up projections (lines 443–449) both read post_chunk from the same slice of post_norm_tile but accumulate separately. Consider whether these can share the load.
  • Down projection inner parallelism: dob loop (line 455) uses chunk=4; profiling may show a different chunk size is better for the memory access pattern.

Code quality

  • Already uses pl.auto_incore(split=pl.SplitMode.UP_DOWN) ✓ — cleanest scope.

Cross-Scope Opportunities

  • Scope 1 → Scope 2 GM traffic: q_proj / k_proj / v_proj are written to GM in Scope 1 and read back in Scope 2. Pipelining or merging these scopes could reduce total GM bandwidth.
  • Consistent matmul pattern: standardise on matmul + matmul_acc (as Scope 1 already does for Q/K/V) across all scopes.

File

examples/models/qwen3/qwen3_32b_decode_tilelet.py

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions