Skip to content

feat: add compact mode to tile config#400

Open
FangRui0 wants to merge 1 commit intohw-native-sys:mainfrom
FangRui0:add_compact
Open

feat: add compact mode to tile config#400
FangRui0 wants to merge 1 commit intohw-native-sys:mainfrom
FangRui0:add_compact

Conversation

@FangRui0
Copy link
Copy Markdown
Contributor

No description provided.

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • rowexpandadd (run, exit=1)
  • reshape (run, exit=1)
  • abs (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #400

rowexpandadd

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_111604_manual_pr400/npu_validation/Rowexpandadd/rowexpandadd/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 920321] 2026-03-31-11:26:14.987.234 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1081, there is an exception of aivec error, core id is 26, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000e0, vec error info: 0x2c, mte error info: 0x710300a846, ifu error info: 0x200003db50b00, ccu error info: 0xcc201000d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x300a846, fixp_error1 info: 0x71, fsmId:0, tslot:2, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1471, report_stream_id=1471, task_id=0, flip_num=0, fault kernel_name=_Z23trowexpandadd_kernel_2dPfS_S_, fault kernel info ext=_Z23trowexpandadd_kernel_2dPfS_S_, program id=0, hash=9985848676763991650.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 11:26:17] ERROR: testcase failed (exit 1): rowexpandadd
reshape

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor/runs/20260331_111604_manual_pr400/npu_validation/Reshape/reshape/main.cpp:75)
[ERROR] RecentErrMsg: [PID: 1211395] 2026-03-31-11:26:28.107.102 Memory_Allocation_Failure(EL0004): Failed to allocate memory.
        Possible Cause: Available memory is insufficient.
        Solution: Close applications not in use.
        TraceBack (most recent call last):
        allocate device memory failed.[FUNC:Init][FILE:memory_pool_manager.cc][LINE:50]
        Fail to init MemoryPoolManager, retCode=0x711000e.[FUNC:Init][FILE:raw_device.cc][LINE:670]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3547]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3171]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3202]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3278]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 2 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6029]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-03-31 11:26:29] ERROR: testcase failed (exit 1): reshape
abs

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_111604_manual_pr400/npu_validation/Abs/abs/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1745279] 2026-03-31-11:35:41.707.746 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1082, there is an exception of aivec error, core id is 3, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x12440000008c, vec error info: 0xf01b, mte error info: 0x7a03000020, ifu error info: 0x212c2ff56f400, ccu error info: 0xcc2000029800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3000020, fixp_error1 info: 0x7a, fsmId:1, tslot:4, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1470, report_stream_id=1470, task_id=0, flip_num=0, fault kernel_name=_Z13abs_kernel_2dPfS_, fault kernel info ext=_Z13abs_kernel_2dPfS_, program id=0, hash=8649095210733992711.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 11:35:43] ERROR: testcase failed (exit 1): abs
[2026-03-31 11:35:43] === SUMMARY ===
[2026-03-31 11:35:43] OK=160 FAIL=3 SKIP=0
[2026-03-31 11:35:43] RESULTS_TSV=/tmp/ptoas-board-monitor/runs/20260331_111604_manual_pr400/remote_npu_validation_results.tsv

@reedhecre
Copy link
Copy Markdown

reedhecre commented Mar 31, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: feat: add compact mode to tile config #400 feat: add compact mode to tile config
  • Author: FangRui0
  • Base/Head: main / add_compact
  • Head SHA: f7801ff330ae
  • Trigger: PR 有新提交
  • Generated At: 2026-03-31T13:28:36Z
  • Previous Head SHA: 29b105044d6a
  • Status: completed

Summary

PR #400 存在 2 个问题:新增的 CompactMode 没有接入现有的 C++ 编译级回归覆盖,且文档声明的文本语法与实际 parser 不一致。

Findings

  1. P2 `compile_cpp` 冒烟用例仍在编译旧版 `Tile<>` 形态,当前变更没有任何端到端编译覆盖 test/compile_cpp/abs_vec_core.cpp:35

本 PR 在 lib/PTO/Transforms/PTOToEmitC.cpp 中把所有生成的 tile 类型都改成了 Tile<..., PadValue::X, CompactMode::Y>,但仓库里唯一的编译级冒烟样例 test/compile_cpp/abs_vec_core.cpp 仍然是旧的 10 参数 Tile<..., PadValue::Null> 形式。test/compile_cpp/README.md:1 还明确说明这个文件是由 ptoas 生成的,因此它现在已经不再代表当前编译器输出。结果是:即使新的 EmitC 输出和 pto/common/pto_instr.hpp/外部工具链不兼容,仓库内的编译脚本也捕不到这个回归。对这类纯代码生成接口变更来说,这是一个重要的 CI/兼容性缺口。

  1. P3 新增的 `compact` 文本语法只接受数字,和 PR 文档宣称的枚举/示例语法不一致 lib/PTO/IR/PTOTypeDefs.cpp:223

TileBufType::parse 这里对 compact 直接调用了 parseInteger,因此 !pto.tile_buf<..., compact=normal> 这种与 blayout/slayout 一致的枚举写法会直接解析失败。与此同时,文档在 docs/PTO_IR_manual.md:204-206 声称 compact 支持 mnemonic,并给出了 #pto.tile_buf_config<row_major, none_box, 16, zero, null> 这样的示例;但 lib/PTO/IR/PTOAttrs.cppTileBufConfigAttr::parse 仍然只接受 keyed 形式和完整 attribute/int attr,因此该示例同样不可解析。也就是说,用户按 PR 新文档使用 compact mode 时会立即遇到 parser 错误。

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • mul (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #400

mul

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor/runs/20260331_164104_manual_pr400/npu_validation/Mul/mul/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 4127433] 2026-03-31-16:55:12.811.284 (EZ9999):  The error from device(chipId:1, dieId:0), serial number is 1089, there is an exception of aivec error, core id is 41, error code = 0x4000000000000000, dump info: pc start: 0x124400000000, current: 0x1244000000b8, vec error info: 0xbf22, mte error info: 0x7a03000920, ifu error info: 0x212c0812000c0, ccu error info: 0x1cc600180d80000f, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:389]
        TraceBack (most recent call last):
       The extend info: errcode:(0x4000000000000000, 0, 0) errorStr: VEC instruction error: the ub address out of bounds. fixp_error0 info: 0x3000920, fixp_error1 info: 0x7a, fsmId:1, tslot:5, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:402]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1497]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1346]
       [DFX_INFO]Aicore kernel execute failed, device_id=2, stream_id=1469, report_stream_id=1469, task_id=0, flip_num=0, fault kernel_name=_Z13mul_kernel_2dPfS_S_, fault kernel info ext=_Z13mul_kernel_2dPfS_S_, program id=0, hash=16858064220790693416.[FUNC:GetError][FILE:stream.cc][LINE:1346]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-03-31 16:55:14] ERROR: testcase failed (exit 1): mul

@FangRui0
Copy link
Copy Markdown
Contributor Author

/run a3

@reedhecre
Copy link
Copy Markdown

A3 板测成功

  • 触发方式:manual
  • 源码提交:12f55ce4e7e1
  • 结果汇总:OK 163 / FAIL 0 / SKIP 0
  • 日志:/tmp/ptoas-board-monitor/logs/20260331_184205_manual_pr400.log
  • 结果 TSV:/tmp/ptoas-board-monitor/logs/20260331_184205_manual_pr400.tsv
  • 手动指令:/run a3
  • 触发人:FangRui0
  • 触发评论:https://github.com/zhangstevenunity/PTOAS/pull/400#issuecomment-4161563812

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants