Summary
Update the Qwen3 scope1 decode projection path to split Q/K/V accumulation into per-hidden-block matmuls in CUBE followed by a separate reduction in VEC.
Motivation / Use Case
The current scope1 implementation on still performs Q, K, and V projection accumulation inside a single incore region(CUBE core) with repeated pl.matmul_acc calls.
We can replace that pattern with:
- per-hidden-block
pl.matmul(...) results written into preallocated partial buffers, and
- a second incore pass that reduces those partials with
pl.add(...).
This keeps the scope1 implementation more explicit, avoids a long single-incore accumulation chain, and makes scope1 more consistent with the recent Qwen3 decode refactoring direction already happening in the repository.
Proposed API / Behavior
No public API change is needed.
In examples/models/qwen3/qwen3_32b_decode_scope1.py, update build_decode_projection_program() so that:
q_partial, k_partial, and v_partial are preallocated before the batch loop
- each hidden block computes its own
pl.matmul(...) result
- partial results are assembled into the corresponding temporary buffer
- accumulation is done in a separate incore block using
pl.full(..., value=0.0) plus repeated pl.add(...)
- the final
q_proj, k_proj, and v_proj outputs keep the same shapes and function signature as today
Alternatives Considered
- Keep the current single-incore
matmul_acc implementation
- Consider exchange the N and K dimension in the loop
- Tune the tiling size of K
Summary
Update the Qwen3 scope1 decode projection path to split Q/K/V accumulation into per-hidden-block matmuls in CUBE followed by a separate reduction in VEC.
Motivation / Use Case
The current scope1 implementation on still performs Q, K, and V projection accumulation inside a single incore region(CUBE core) with repeated
pl.matmul_acccalls.We can replace that pattern with:
pl.matmul(...)results written into preallocated partial buffers, andpl.add(...).This keeps the scope1 implementation more explicit, avoids a long single-incore accumulation chain, and makes scope1 more consistent with the recent Qwen3 decode refactoring direction already happening in the repository.
Proposed API / Behavior
No public API change is needed.
In
examples/models/qwen3/qwen3_32b_decode_scope1.py, updatebuild_decode_projection_program()so that:q_partial,k_partial, andv_partialare preallocated before the batch looppl.matmul(...)resultpl.full(..., value=0.0)plus repeatedpl.add(...)q_proj,k_proj, andv_projoutputs keep the same shapes and function signature as todayAlternatives Considered
matmul_accimplementation