Skip to content

[Perf] Split scope1 projection accumulation in Qwen3 decode example #81

@ndleslx

Description

@ndleslx

Summary

Update the Qwen3 scope1 decode projection path to split Q/K/V accumulation into per-hidden-block matmuls in CUBE followed by a separate reduction in VEC.

Motivation / Use Case

The current scope1 implementation on still performs Q, K, and V projection accumulation inside a single incore region(CUBE core) with repeated pl.matmul_acc calls.

We can replace that pattern with:

  1. per-hidden-block pl.matmul(...) results written into preallocated partial buffers, and
  2. a second incore pass that reduces those partials with pl.add(...).

This keeps the scope1 implementation more explicit, avoids a long single-incore accumulation chain, and makes scope1 more consistent with the recent Qwen3 decode refactoring direction already happening in the repository.

Proposed API / Behavior

No public API change is needed.

In examples/models/qwen3/qwen3_32b_decode_scope1.py, update build_decode_projection_program() so that:

  • q_partial, k_partial, and v_partial are preallocated before the batch loop
  • each hidden block computes its own pl.matmul(...) result
  • partial results are assembled into the corresponding temporary buffer
  • accumulation is done in a separate incore block using pl.full(..., value=0.0) plus repeated pl.add(...)
  • the final q_proj, k_proj, and v_proj outputs keep the same shapes and function signature as today

Alternatives Considered

  • Keep the current single-incore matmul_acc implementation
  • Consider exchange the N and K dimension in the loop
  • Tune the tiling size of K

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions