feat: Add async comm ops and emitc lowering by FangRui0 · Pull Request #349 · hw-native-sys/PTOAS

FangRui0 · 2026-03-24T13:06:21Z

No description provided.

FangRui0 · 2026-03-31T07:45:57Z

/run a3

reedhecre · 2026-03-31T07:52:13Z

Codex Review

该评论由 review 机器人自动更新。

PR: feat: Add async comm ops and emitc lowering #349 feat: Add async comm ops and emitc lowering
Author: FangRui0
Base/Head: main / add_op
Head SHA: 73a30babb3b3
Trigger: PR 有新提交
Generated At: 2026-03-31T09:52:50Z
Previous Head SHA: a520648f7ff3
Status: completed

Summary

发现 2 个 P1 和 1 个 P2：async 传输缺少元素类型约束会静默错降为 float，session 的 scratch 生命周期没有被 plan-memory 保活，test_async_event 的副作用建模会让 CSE 折叠重复轮询。

Findings

P1 Async 传输允许不受支持的元素类型，并在 lowering 时静默变成 float lib/PTO/IR/PTO.cpp:8529

这里的 verifier 只检查了 src/dst 元素类型相同和形状一致，没有限制元素类型必须能被 GlobalTensor lowering 支持。后面的 EmitC lowering 会调用 getElemTypeStringForGT()；它对未支持类型直接回退到 "float"。因此像 memref<128xi1>、memref<128xi24> 这类输入会通过校验，但最终生成的是 GlobalTensor<float,...>，传输字节数和元素语义都会错，属于实质性误编译。

P1 build_async_session 生成的 session 没有把 scratch buffer 的生命周期延长到后续 async 使用点 lib/PTO/Transforms/PTOPlanMemory.cpp:309

PlanMemory 现在把 BuildAsyncSessionOp / TPutAsyncOp / TGetAsyncOp 当成只使用“直接操作数”的普通 op 处理，但 session 结果并没有和 scratch memref 建立 alias/liveness 关系。这样在 pto.alloc_tile 先被降成 memref 之后，如果后面只继续使用 session SSA 值，scratch buffer 就会在 build_async_session 之后被判死并复用。EmitC lowering 又把 scratch 真正塞进了 session 里，后续 async DMA 只拿 session 工作，所以在默认的非 Level3 构建里，这会把 session 背后的本地 scratch 提前覆盖掉，直接破坏运行时正确性。

P2 test_async_event 被建模成只读查询，默认 CSE 会把后续轮询折叠成第一次结果 lib/PTO/IR/PTO.cpp:8696

TestAsyncEventOp::getEffects() 只声明读取 event/session，并把结果当成普通 SSA 产物写出。默认流水线在 lowering 前仍然运行 createCSEPass()，因此两个对同一 event/session 的 test_async_event 在中间没有“写入”时，会被优化器视为等价表达式并复用第一次结果。异步事件的完成状态会随时间推进而变化，并不需要任何 IR 可见的写入，所以这种副作用建模对轮询场景是不成立的。

reedhecre · 2026-03-31T08:07:11Z

A3 板测失败

触发方式：manual
源码提交：c244ded49a31
结果汇总：OK 163 / FAIL 1 / SKIP 0
日志：/tmp/ptoas-board-monitor/logs/20260331_154604_manual_pr349.log
手动指令：/run a3
触发人：FangRui0
触发评论：https://github.com/zhangstevenunity/PTOAS/pull/349#issuecomment-4160603084
失败阶段：board-validation / exit=1

失败用例

async_comm (run, exit=1)

reedhecre · 2026-03-31T08:07:13Z

A3 板测失败详情：PR #349

async_comm

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_154604_manual_pr349/npu_validation/AsyncComm/async_comm/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3613456] 2026-03-31-16:06:17.598.304 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1088, there is an exception of aivec error, core id is 13, error code = 0, dump info: pc start: 0x124400000000, current: 0x1244000001ac, vec error info: 0x6000000079, mte error info: 0x7a03000120, ifu error info: 0x2fffef6987700, ccu error info: 0x1cc0001000000063, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x3000120, fixp_error1 info: 0x7a, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1466, report_stream_id=1466, task_id=0, flip_num=0, fault kernel_name=_Z17async_comm_kernelPfS_Pa, fault kernel info ext=_Z17async_comm_kernelPfS_Pa, program id=0, hash=11225892057099182609.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 16:06:18] ERROR: testcase failed (exit 1): async_comm

FangRui0 · 2026-03-31T08:59:49Z

/run a3

FangRui0 · 2026-03-31T09:15:28Z

/run a3

FangRui0 · 2026-03-31T09:15:58Z

/run a3

reedhecre · 2026-03-31T09:20:12Z

A3 板测失败

触发方式：manual
源码提交：c244ded49a31
结果汇总：OK 161 / FAIL 3 / SKIP 0
日志：/tmp/ptoas-board-monitor/logs/20260331_170105_manual_pr349.log
手动指令：/run a3
触发人：FangRui0
触发评论：https://github.com/zhangstevenunity/PTOAS/pull/349#issuecomment-4161033055
失败阶段：board-validation / exit=1

失败用例

lrelu (run, exit=1)
log (run, exit=1)
async_comm (run, exit=1)

reedhecre · 2026-03-31T09:20:13Z

A3 板测失败详情：PR #349

lrelu

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_170105_manual_pr349/npu_validation/Lrelu/lrelu/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 267541] 2026-03-31-17:16:28.480.789 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1090, there is an exception of aivec error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x124400000094, vec error info: 0xf01d, mte error info: 0x79030001b0, ifu error info: 0x20000fc567040, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1466, report_stream_id=1466, task_id=0, flip_num=0, fault kernel_name=_Z24vec_add_scalar_kernel_2dPfS_, fault kernel info ext=_Z24vec_add_scalar_kernel_2dPfS_, program id=0, hash=1998685396591677942.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:16:29] ERROR: testcase failed (exit 1): lrelu

log

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_170105_manual_pr349/npu_validation/Log/log/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 268207] 2026-03-31-17:16:32.315.455 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1091, there is an exception of aivec error, core id is 1, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x12440000008c, vec error info: 0xf01b, mte error info: 0x7a03002e20, ifu error info: 0x212c2ff56f440, ccu error info: 0xcc201000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3002e20, fixp_error1 info: 0x7a, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1488, report_stream_id=1488, task_id=0, flip_num=0, fault kernel_name=_Z17vec_log_kernel_2dPfS_, fault kernel info ext=_Z17vec_log_kernel_2dPfS_, program id=0, hash=4177555151498116596.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:16:34] ERROR: testcase failed (exit 1): log

async_comm

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_170105_manual_pr349/npu_validation/AsyncComm/async_comm/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 307289] 2026-03-31-17:19:26.076.619 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1092, there is an exception of aivec error, core id is 8, error code = 0, dump info: pc start: 0x124400000000, current: 0x1244000001ac, vec error info: 0xde24, mte error info: 0x7a03001820, ifu error info: 0x200010180d080, ccu error info: 0xcc2000000000063, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x3001820, fixp_error1 info: 0x7a, fsmId:0, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1492, report_stream_id=1492, task_id=0, flip_num=0, fault kernel_name=_Z17async_comm_kernelPfS_Pa, fault kernel info ext=_Z17async_comm_kernelPfS_Pa, program id=0, hash=11225892057099182609.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:19:27] ERROR: testcase failed (exit 1): async_comm

FangRui0 · 2026-03-31T09:34:09Z

/run a3

reedhecre · 2026-03-31T10:01:23Z

A3 板测失败

触发方式：manual
源码提交：a3df729a41ee
结果汇总：OK 164 / FAIL 1 / SKIP 1
日志：/tmp/ptoas-board-monitor/logs/20260331_174121_manual_pr349.log
手动指令：/run a3
触发人：FangRui0
触发评论：https://github.com/zhangstevenunity/PTOAS/pull/349#issuecomment-4161124517
失败阶段：board-validation / exit=1

失败用例

rowsum (run, exit=1)

reedhecre · 2026-03-31T10:01:25Z

A3 板测失败详情：PR #349

rowsum

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_174121_manual_pr349/npu_validation/Rowsum/rowsum/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 718387] 2026-03-31-17:51:08.926.475 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1095, there is an exception of aivec error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000bc, vec error info: 0xe020, mte error info: 0x79030001b0, ifu error info: 0x20000fc567040, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1468, report_stream_id=1468, task_id=0, flip_num=0, fault kernel_name=_Z16rowsum_kernel_2dPfS_, fault kernel info ext=_Z16rowsum_kernel_2dPfS_, program id=0, hash=2853499709610389473.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:51:11] ERROR: testcase failed (exit 1): rowsum

reedhecre · 2026-03-31T10:21:35Z

A3 板测失败

触发方式：manual
源码提交：a3df729a41ee
结果汇总：OK 161 / FAIL 4 / SKIP 1
日志：/tmp/ptoas-board-monitor/logs/20260331_180205_manual_pr349.log
手动指令：/run a3
触发人：FangRui0
触发评论：https://github.com/zhangstevenunity/PTOAS/pull/349#issuecomment-4161127466
失败阶段：board-validation / exit=1

失败用例

test_inject_sync_two_event_id (run, exit=2)
shls (run, exit=1)
partmax (run, exit=1)
ors (run, exit=1)

reedhecre · 2026-03-31T10:21:36Z

A3 板测失败详情：PR #349

test_inject_sync_two_event_id

stage=run info=exit=2

[ERROR] Mismatch: golden_v4.bin vs v4.bin, max diff=8.21484375 at idx=241 (golden=-6.484375, out=1.73046875, dtype=float16)
[ERROR] compare failed
[2026-03-31 18:08:08] ERROR: testcase failed (exit 2): test_inject_sync_two_event_id

shls

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_180205_manual_pr349/npu_validation/Shls/shls/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 960222] 2026-03-31-18:10:43.495.321 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1096, there is an exception of aivec error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x124400000090, vec error info: 0xf01c, mte error info: 0x79030001b0, ifu error info: 0x20000fc567040, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1485, report_stream_id=1485, task_id=0, flip_num=0, fault kernel_name=_Z18vec_shls_kernel_2dPiS_, fault kernel info ext=_Z18vec_shls_kernel_2dPiS_, program id=0, hash=5364862739528093252.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:10:44] ERROR: testcase failed (exit 1): shls

partmax

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_180205_manual_pr349/npu_validation/Partmax/partmax/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1031709] 2026-03-31-18:15:28.900.641 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1097, there is an exception of aivec error, core id is 1, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000b8, vec error info: 0xbf22, mte error info: 0x7a03002e20, ifu error info: 0x212c2ff56f440, ccu error info: 0xcc201000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3002e20, fixp_error1 info: 0x7a, fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1483, report_stream_id=1483, task_id=0, flip_num=0, fault kernel_name=_Z17partmax_kernel_2dPfS_S_, fault kernel info ext=_Z17partmax_kernel_2dPfS_S_, program id=0, hash=12054057414982499609.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:15:31] ERROR: testcase failed (exit 1): partmax

ors

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_180205_manual_pr349/npu_validation/Ors/ors/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1041997] 2026-03-31-18:16:00.366.124 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1098, there is an exception of aivec error, core id is 9, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000c8, vec error info: 0xf81c, mte error info: 0x2606000065, ifu error info: 0x212e98300b300, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x6000065, fixp_error1 info: 0x26, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1486, report_stream_id=1486, task_id=0, flip_num=0, fault kernel_name=_Z13ors_kernel_2dPsS_, fault kernel info ext=_Z13ors_kernel_2dPsS_, program id=0, hash=9316633765227175002.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:16:01] ERROR: testcase failed (exit 1): ors

reedhecre · 2026-03-31T10:41:35Z

A3 板测失败

触发方式：manual
源码提交：a3df729a41ee
结果汇总：OK 163 / FAIL 2 / SKIP 1
日志：/tmp/ptoas-board-monitor/logs/20260331_182204_manual_pr349.log
手动指令：/run a3
触发人：FangRui0
触发评论：https://github.com/zhangstevenunity/PTOAS/pull/349#issuecomment-4161239329
失败阶段：board-validation / exit=1

失败用例

recip (run, exit=1)
paged_attention_example_kernel_softmax_prepare (run, exit=1)

reedhecre · 2026-03-31T10:41:36Z

A3 板测失败详情：PR #349

recip

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_182204_manual_pr349/npu_validation/Recip/recip/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1284903] 2026-03-31-18:32:43.239.216 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1099, there is an exception of aivec error, core id is 48, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000ac, vec error info: 0xf01d, mte error info: 0x79030001b0, ifu error info: 0x212c2ff56f440, ccu error info: 0xcc2010014800053, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1461, report_stream_id=1461, task_id=0, flip_num=0, fault kernel_name=_Z15recip_kernel_2dPfS_, fault kernel info ext=_Z15recip_kernel_2dPfS_, program id=0, hash=17972813646155970315.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:32:45] ERROR: testcase failed (exit 1): recip

paged_attention_example_kernel_softmax_prepare

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_182204_manual_pr349/npu_validation/PyPTOIRParser/paged_attention_example_kernel_softmax_prepare/main.cpp:108)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1290565] 2026-03-31-18:33:06.501.975 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1100, there is an exception of aivec error, core id is 7, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000003f0, vec error info: 0xe01e, mte error info: 0x7a03000320, ifu error info: 0x20000fc567040, ccu error info: 0xcc201000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3000320, fixp_error1 info: 0x7a, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1468, report_stream_id=1468, task_id=0, flip_num=0, fault kernel_name=_Z22kernel_softmax_preparePffPu6__bf16S_S_, fault kernel info ext=_Z22kernel_softmax_preparePffPu6__bf16S_S_, program id=0, hash=16053642148614005971.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:33:07] ERROR: testcase failed (exit 1): paged_attention_example_kernel_softmax_prepare

FangRui0 force-pushed the add_op branch 8 times, most recently from f396b1e to a520648 Compare March 31, 2026 07:35

FangRui0 force-pushed the add_op branch from a520648 to 1fd7690 Compare March 31, 2026 09:11

feat: Add async comm ops and emitc lowering

73a30ba

FangRui0 force-pushed the add_op branch from 1fd7690 to 73a30ba Compare March 31, 2026 09:33

jiashu added this to pto project Apr 1, 2026

github-project-automation bot moved this to Todo in pto project Apr 1, 2026

Conversation

FangRui0 commented Mar 24, 2026

Uh oh!

FangRui0 commented Mar 31, 2026

Uh oh!

reedhecre commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败

失败用例

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败详情：PR #349

Uh oh!

FangRui0 commented Mar 31, 2026

Uh oh!

FangRui0 commented Mar 31, 2026

Uh oh!

FangRui0 commented Mar 31, 2026

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败

失败用例

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败详情：PR #349

Uh oh!

FangRui0 commented Mar 31, 2026

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败

失败用例

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败详情：PR #349

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败

失败用例

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败详情：PR #349

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败

失败用例

Uh oh!

reedhecre commented Mar 31, 2026

A3 板测失败详情：PR #349

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reedhecre commented Mar 31, 2026 •

edited

Loading