Skip to content

A5 target hangs on RMSNorm vector kernel that passes on A2A3 #441

@zhangqi-chen

Description

@zhangqi-chen

Problem

A PTO-IR vector kernel (decode_projection_incore_0.pto) compiles and runs correctly when targeting A2A3, but hangs at runtime when compiled for A5.

Environment

  • PTOAS version: 0.22

Reproduction

PTO file attached below.

decode_projection_incore_0.pto.txt
rmsnorm_incore_0.pto.txt

Steps:

  1. Change pto.target_arch from "a2a3" to "a5" in the module attributes
  2. Compile with ptoas
  3. Run on A5 platform — program hangs indefinitely (no crash, no error)

Behavior

Target Compile Run
A2A3 OK OK
A5 OK Hangs

Kernel Summary

This is a RMSNorm vector kernel (decode_projection_incore_0) from the Qwen3-32B decode layer projection. It operates on [16, 5120] BF16 input with K_CHUNK=128 (40 iterations):

  1. Loop 1 — accumulate squared partial sums: tloadtcvt(bf16→f32) → tmul(x²) → trowsumtadd (accumulate) → tmov
  2. Post-loop — compute inv_rms: tmuls(÷5120) → tadds(+ε) → trsqrt
  3. Loop 2 — apply normalization: tloadtcvttrowexpandmul(×inv_rms) → tcolexpandmul(×γ) → tcvt(f32→bf16) → tstore

Operations Used

tload, tstore, tcvt, tmul, trowsum, tadd, tmov, tmuls, tadds, trsqrt, trowexpandmul, tcolexpandmul, texpands

PTO File

Context

Discovered during E2E validation of pypto-lib Qwen3-32B decode tilelet (hw-native-sys/pypto-lib#58, Scope 1).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions