Skip to content

feat: Add async comm ops and emitc lowering#349

Open
FangRui0 wants to merge 1 commit intohw-native-sys:mainfrom
FangRui0:add_op
Open

feat: Add async comm ops and emitc lowering#349
FangRui0 wants to merge 1 commit intohw-native-sys:mainfrom
FangRui0:add_op

Conversation

@FangRui0
Copy link
Copy Markdown
Contributor

No description provided.

@FangRui0 FangRui0 force-pushed the add_op branch 8 times, most recently from f396b1e to a520648 Compare March 31, 2026 07:35
@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

reedhecre commented Mar 31, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: feat: Add async comm ops and emitc lowering #349 feat: Add async comm ops and emitc lowering
  • Author: FangRui0
  • Base/Head: main / add_op
  • Head SHA: 73a30babb3b3
  • Trigger: PR 有新提交
  • Generated At: 2026-03-31T09:52:50Z
  • Previous Head SHA: a520648f7ff3
  • Status: completed

Summary

发现 2 个 P1 和 1 个 P2:async 传输缺少元素类型约束会静默错降为 float,session 的 scratch 生命周期没有被 plan-memory 保活,test_async_event 的副作用建模会让 CSE 折叠重复轮询。

Findings

  1. P1 Async 传输允许不受支持的元素类型,并在 lowering 时静默变成 float lib/PTO/IR/PTO.cpp:8529

这里的 verifier 只检查了 src/dst 元素类型相同和形状一致,没有限制元素类型必须能被 GlobalTensor lowering 支持。后面的 EmitC lowering 会调用 getElemTypeStringForGT();它对未支持类型直接回退到 "float"。因此像 memref<128xi1>、memref<128xi24> 这类输入会通过校验,但最终生成的是 GlobalTensor<float,...>,传输字节数和元素语义都会错,属于实质性误编译。

  1. P1 build_async_session 生成的 session 没有把 scratch buffer 的生命周期延长到后续 async 使用点 lib/PTO/Transforms/PTOPlanMemory.cpp:309

PlanMemory 现在把 BuildAsyncSessionOp / TPutAsyncOp / TGetAsyncOp 当成只使用“直接操作数”的普通 op 处理,但 session 结果并没有和 scratch memref 建立 alias/liveness 关系。这样在 pto.alloc_tile 先被降成 memref 之后,如果后面只继续使用 session SSA 值,scratch buffer 就会在 build_async_session 之后被判死并复用。EmitC lowering 又把 scratch 真正塞进了 session 里,后续 async DMA 只拿 session 工作,所以在默认的非 Level3 构建里,这会把 session 背后的本地 scratch 提前覆盖掉,直接破坏运行时正确性。

  1. P2 test_async_event 被建模成只读查询,默认 CSE 会把后续轮询折叠成第一次结果 lib/PTO/IR/PTO.cpp:8696

TestAsyncEventOp::getEffects() 只声明读取 event/session,并把结果当成普通 SSA 产物写出。默认流水线在 lowering 前仍然运行 createCSEPass(),因此两个对同一 event/session 的 test_async_event 在中间没有“写入”时,会被优化器视为等价表达式并复用第一次结果。异步事件的完成状态会随时间推进而变化,并不需要任何 IR 可见的写入,所以这种副作用建模对轮询场景是不成立的。

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • async_comm (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #349

async_comm

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_154604_manual_pr349/npu_validation/AsyncComm/async_comm/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3613456] 2026-03-31-16:06:17.598.304 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1088, there is an exception of aivec error, core id is 13, error code = 0, dump info: pc start: 0x124400000000, current: 0x1244000001ac, vec error info: 0x6000000079, mte error info: 0x7a03000120, ifu error info: 0x2fffef6987700, ccu error info: 0x1cc0001000000063, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x3000120, fixp_error1 info: 0x7a, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1466, report_stream_id=1466, task_id=0, flip_num=0, fault kernel_name=_Z17async_comm_kernelPfS_Pa, fault kernel info ext=_Z17async_comm_kernelPfS_Pa, program id=0, hash=11225892057099182609.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 16:06:18] ERROR: testcase failed (exit 1): async_comm

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

1 similar comment
@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • lrelu (run, exit=1)
  • log (run, exit=1)
  • async_comm (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #349

lrelu

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_170105_manual_pr349/npu_validation/Lrelu/lrelu/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 267541] 2026-03-31-17:16:28.480.789 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1090, there is an exception of aivec error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x124400000094, vec error info: 0xf01d, mte error info: 0x79030001b0, ifu error info: 0x20000fc567040, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1466, report_stream_id=1466, task_id=0, flip_num=0, fault kernel_name=_Z24vec_add_scalar_kernel_2dPfS_, fault kernel info ext=_Z24vec_add_scalar_kernel_2dPfS_, program id=0, hash=1998685396591677942.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:16:29] ERROR: testcase failed (exit 1): lrelu
log

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_170105_manual_pr349/npu_validation/Log/log/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 268207] 2026-03-31-17:16:32.315.455 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1091, there is an exception of aivec error, core id is 1, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x12440000008c, vec error info: 0xf01b, mte error info: 0x7a03002e20, ifu error info: 0x212c2ff56f440, ccu error info: 0xcc201000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3002e20, fixp_error1 info: 0x7a, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1488, report_stream_id=1488, task_id=0, flip_num=0, fault kernel_name=_Z17vec_log_kernel_2dPfS_, fault kernel info ext=_Z17vec_log_kernel_2dPfS_, program id=0, hash=4177555151498116596.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:16:34] ERROR: testcase failed (exit 1): log
async_comm

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_170105_manual_pr349/npu_validation/AsyncComm/async_comm/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 307289] 2026-03-31-17:19:26.076.619 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1092, there is an exception of aivec error, core id is 8, error code = 0, dump info: pc start: 0x124400000000, current: 0x1244000001ac, vec error info: 0xde24, mte error info: 0x7a03001820, ifu error info: 0x200010180d080, ccu error info: 0xcc2000000000063, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x3001820, fixp_error1 info: 0x7a, fsmId:0, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1492, report_stream_id=1492, task_id=0, flip_num=0, fault kernel_name=_Z17async_comm_kernelPfS_Pa, fault kernel info ext=_Z17async_comm_kernelPfS_Pa, program id=0, hash=11225892057099182609.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:19:27] ERROR: testcase failed (exit 1): async_comm

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • rowsum (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #349

rowsum

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_174121_manual_pr349/npu_validation/Rowsum/rowsum/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 718387] 2026-03-31-17:51:08.926.475 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1095, there is an exception of aivec error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000bc, vec error info: 0xe020, mte error info: 0x79030001b0, ifu error info: 0x20000fc567040, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1468, report_stream_id=1468, task_id=0, flip_num=0, fault kernel_name=_Z16rowsum_kernel_2dPfS_, fault kernel info ext=_Z16rowsum_kernel_2dPfS_, program id=0, hash=2853499709610389473.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 17:51:11] ERROR: testcase failed (exit 1): rowsum

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • test_inject_sync_two_event_id (run, exit=2)
  • shls (run, exit=1)
  • partmax (run, exit=1)
  • ors (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #349

test_inject_sync_two_event_id

stage=run info=exit=2

[ERROR] Mismatch: golden_v4.bin vs v4.bin, max diff=8.21484375 at idx=241 (golden=-6.484375, out=1.73046875, dtype=float16)
[ERROR] compare failed
[2026-03-31 18:08:08] ERROR: testcase failed (exit 2): test_inject_sync_two_event_id
shls

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_180205_manual_pr349/npu_validation/Shls/shls/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 960222] 2026-03-31-18:10:43.495.321 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1096, there is an exception of aivec error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x124400000090, vec error info: 0xf01c, mte error info: 0x79030001b0, ifu error info: 0x20000fc567040, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1485, report_stream_id=1485, task_id=0, flip_num=0, fault kernel_name=_Z18vec_shls_kernel_2dPiS_, fault kernel info ext=_Z18vec_shls_kernel_2dPiS_, program id=0, hash=5364862739528093252.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:10:44] ERROR: testcase failed (exit 1): shls
partmax

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_180205_manual_pr349/npu_validation/Partmax/partmax/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1031709] 2026-03-31-18:15:28.900.641 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1097, there is an exception of aivec error, core id is 1, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000b8, vec error info: 0xbf22, mte error info: 0x7a03002e20, ifu error info: 0x212c2ff56f440, ccu error info: 0xcc201000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3002e20, fixp_error1 info: 0x7a, fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1483, report_stream_id=1483, task_id=0, flip_num=0, fault kernel_name=_Z17partmax_kernel_2dPfS_S_, fault kernel info ext=_Z17partmax_kernel_2dPfS_S_, program id=0, hash=12054057414982499609.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:15:31] ERROR: testcase failed (exit 1): partmax
ors

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_180205_manual_pr349/npu_validation/Ors/ors/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1041997] 2026-03-31-18:16:00.366.124 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1098, there is an exception of aivec error, core id is 9, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000c8, vec error info: 0xf81c, mte error info: 0x2606000065, ifu error info: 0x212e98300b300, ccu error info: 0xcc200000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x6000065, fixp_error1 info: 0x26, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1486, report_stream_id=1486, task_id=0, flip_num=0, fault kernel_name=_Z13ors_kernel_2dPsS_, fault kernel info ext=_Z13ors_kernel_2dPsS_, program id=0, hash=9316633765227175002.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:16:01] ERROR: testcase failed (exit 1): ors

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • recip (run, exit=1)
  • paged_attention_example_kernel_softmax_prepare (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #349

recip

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_182204_manual_pr349/npu_validation/Recip/recip/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1284903] 2026-03-31-18:32:43.239.216 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1099, there is an exception of aivec error, core id is 48, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000ac, vec error info: 0xf01d, mte error info: 0x79030001b0, ifu error info: 0x212c2ff56f440, ccu error info: 0xcc2010014800053, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x30001b0, fixp_error1 info: 0x79, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1461, report_stream_id=1461, task_id=0, flip_num=0, fault kernel_name=_Z15recip_kernel_2dPfS_, fault kernel info ext=_Z15recip_kernel_2dPfS_, program id=0, hash=17972813646155970315.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:32:45] ERROR: testcase failed (exit 1): recip
paged_attention_example_kernel_softmax_prepare

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_182204_manual_pr349/npu_validation/PyPTOIRParser/paged_attention_example_kernel_softmax_prepare/main.cpp:108)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1290565] 2026-03-31-18:33:06.501.975 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1100, there is an exception of aivec error, core id is 7, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000003f0, vec error info: 0xe01e, mte error info: 0x7a03000320, ifu error info: 0x20000fc567040, ccu error info: 0xcc201000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3000320, fixp_error1 info: 0x7a, fsmId:1, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1468, report_stream_id=1468, task_id=0, flip_num=0, fault kernel_name=_Z22kernel_softmax_preparePffPu6__bf16S_S_, fault kernel info ext=_Z22kernel_softmax_preparePffPu6__bf16S_S_, program id=0, hash=16053642148614005971.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 18:33:07] ERROR: testcase failed (exit 1): paged_attention_example_kernel_softmax_prepare

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants