-
Notifications
You must be signed in to change notification settings - Fork 30
A5 target hangs on RMSNorm vector kernel that passes on A2A3 #441
Copy link
Copy link
Open
Description
Problem
A PTO-IR vector kernel (decode_projection_incore_0.pto) compiles and runs correctly when targeting A2A3, but hangs at runtime when compiled for A5.
Environment
- PTOAS version: 0.22
Reproduction
PTO file attached below.
decode_projection_incore_0.pto.txt
rmsnorm_incore_0.pto.txt
Steps:
- Change
pto.target_archfrom"a2a3"to"a5"in the module attributes - Compile with
ptoas - Run on A5 platform — program hangs indefinitely (no crash, no error)
Behavior
| Target | Compile | Run |
|---|---|---|
| A2A3 | OK | OK |
| A5 | OK | Hangs |
Kernel Summary
This is a RMSNorm vector kernel (decode_projection_incore_0) from the Qwen3-32B decode layer projection. It operates on [16, 5120] BF16 input with K_CHUNK=128 (40 iterations):
- Loop 1 — accumulate squared partial sums:
tload→tcvt(bf16→f32) →tmul(x²) →trowsum→tadd(accumulate) →tmov - Post-loop — compute inv_rms:
tmuls(÷5120) →tadds(+ε) →trsqrt - Loop 2 — apply normalization:
tload→tcvt→trowexpandmul(×inv_rms) →tcolexpandmul(×γ) →tcvt(f32→bf16) →tstore
Operations Used
tload, tstore, tcvt, tmul, trowsum, tadd, tmov, tmuls, tadds, trsqrt, trowexpandmul, tcolexpandmul, texpands
PTO File
Context
Discovered during E2E validation of pypto-lib Qwen3-32B decode tilelet (hw-native-sys/pypto-lib#58, Scope 1).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels