[Perf] qwen3_32b_decode_tilelet — Performance & Code Improvement

## Objective

Improve the end-to-end performance and code quality of `examples/models/qwen3/qwen3_32b_decode_tilelet.py`, targeting reduced GM round-trips, better on-chip utilisation, and cleaner code patterns across all three `pl.auto_incore()` / `pl.incore()` scopes.

## Scope-by-Scope Analysis

### Scope 1 — Input RMSNorm + Q/K/V Projection (lines ~152–233)

**Performance**
- **Zero-init overhead**: sequential loops (lines 153–166) initialise `q_proj`, `attn_out`, `k_proj`, `v_proj` one chunk at a time. Consider using `pl.parallel` or eliminating the zero-init by handling the first matmul iteration differently.
- **K/V projection separation**: K and V share the same `normed_tile` input but are computed in separate `pl.incore()` blocks (lines 212–233). Fusing them into a single incore region would halve the `normed_tile` GM reads.
- **Manual incore**: Scope 1 uses explicit `pl.incore()` instead of `pl.auto_incore()`, missing potential compiler optimisations for buffer placement and scheduling.

**Code quality**
- **RMSNorm missing `rsqrt`**: lines 184–196 compute `variance = mean(x²) + eps` but then multiply `x_chunk` by `variance` directly instead of `rsqrt(variance)`. Compare with Scope 3 (line 422–423) which correctly uses `pl.rsqrt`. The golden reference has the same bug, so results match — but the math is wrong for actual RMSNorm.
- **Docstring stale**: module docstring (line 27) still references `BATCH_TILE=4` but the constant was changed to 16 (line 89).

### Scope 2 — RoPE + KV Cache Update + Decode Attention (lines ~240–390)

**Performance**
- **Excessive GM round-trips in attention loop**: each `ctx_blocks` iteration has 4 separate `pl.incore()` stages (QK matmul → softmax → SV matmul → online rescale, lines 328–381). Intermediate tensors `raw_scores_pad`, `exp_padded`, `oi_tmp_pad` are written to GM and read back between stages. Fusing stages would eliminate these round-trips.
- **Padded matmul waste**: `q_padded` is `[Q_HEAD_PAD=16, HEAD_DIM=128]` but only `Q_HEAD_BATCH=8` rows are valid. The QK and SV matmuls compute 2× the necessary rows then discard half.
- **Consider `matmul_acc`**: QK matmul (line 335) and SV matmul (line 364) could benefit from `matmul_acc` pattern if the loop structure is refactored.

**Code quality**
- **`raw_scores_pad` write-then-read**: `raw_scores_pad` is created as a GM tensor (line 327), written via matmul (line 335), then sliced for softmax (line 343). This explicit GM tensor could be avoided if the stages were fused.

### Scope 3 — Output Projection + Post-RMSNorm + MLP + Residual (lines ~393–469)

**Performance**
- **Output projection pattern**: line 409 uses `pl.add(o_acc, pl.matmul(a_chunk, w_chunk))` instead of `pl.matmul_acc`, losing potential hardware accumulation.
- **Gate/Up shared reads**: MLP gate and up projections (lines 443–449) both read `post_chunk` from the same slice of `post_norm_tile` but accumulate separately. Consider whether these can share the load.
- **Down projection inner parallelism**: `dob` loop (line 455) uses `chunk=4`; profiling may show a different chunk size is better for the memory access pattern.

**Code quality**
- Already uses `pl.auto_incore(split=pl.SplitMode.UP_DOWN)` ✓ — cleanest scope.

## Cross-Scope Opportunities

- **Scope 1 → Scope 2 GM traffic**: `q_proj` / `k_proj` / `v_proj` are written to GM in Scope 1 and read back in Scope 2. Pipelining or merging these scopes could reduce total GM bandwidth.
- **Consistent matmul pattern**: standardise on `matmul` + `matmul_acc` (as Scope 1 already does for Q/K/V) across all scopes.

## File

`examples/models/qwen3/qwen3_32b_decode_tilelet.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] qwen3_32b_decode_tilelet — Performance & Code Improvement #70

Objective

Scope-by-Scope Analysis

Scope 1 — Input RMSNorm + Q/K/V Projection (lines ~152–233)

Scope 2 — RoPE + KV Cache Update + Decode Attention (lines ~240–390)

Scope 3 — Output Projection + Post-RMSNorm + MLP + Residual (lines ~393–469)

Cross-Scope Opportunities

File

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Perf] qwen3_32b_decode_tilelet — Performance & Code Improvement #70

Description

Objective

Scope-by-Scope Analysis

Scope 1 — Input RMSNorm + Q/K/V Projection (lines ~152–233)

Scope 2 — RoPE + KV Cache Update + Decode Attention (lines ~240–390)

Scope 3 — Output Projection + Post-RMSNorm + MLP + Residual (lines ~393–469)

Cross-Scope Opportunities

File

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions