Skip to content

[codex] fix Cube kernel hang in generated C++#443

Open
zhangstevenunity wants to merge 1 commit intomainfrom
codex/fix-issue-428
Open

[codex] fix Cube kernel hang in generated C++#443
zhangstevenunity wants to merge 1 commit intomainfrom
codex/fix-issue-428

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

Summary

Root cause

WidenEventId was reallocating whole scope pairs even when they contained loop-carried/back-edge sync operations. That rewrite could drop the pre-loop set_flag / tail drain pairing for PIPE_M -> PIPE_MTE1, which left the generated Cube kernel with loop-head waits that were never primed.

Validation

  • build ptoas with -DPTO_ENABLE_PYTHON_BINDING=OFF -DBUILD_TESTING=OFF
  • run FileCheck on test/basic/issue428_cube_sync_regression.pto
  • run FileCheck on test/basic/insert_sync_level3_enable.pto
  • run FileCheck on test/basic/tinsert_a3_pipe_selection.pto
  • run FileCheck on test/basic/tinsert_a5_pipe_selection.pto
  • run FileCheck on test/basic/tmov_acc_mat_pipe_selection.pto
  • run FileCheck on test/basic/tpush_tpop_frontend_lowering_a3.pto with --check-prefix=SYNC-A3

Closes #428

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request prevents the reallocation of scope pairs containing loop-carried syncs during the event ID widening phase, ensuring that necessary head/tail schedules for loops remain intact. It also adds a regression test for issue #428. Feedback suggests optimizing the complexity of the loop-carried sync check by pre-calculating the affected scope pairs to avoid potential performance overhead in large kernels.

Comment on lines +493 to +513
bool SyncEventIdAllocation::scopePairHasLoopCarriedSync(int scopePair) const {
for (auto &element : syncIR_) {
for (auto *sync : element->pipeBefore) {
if (!sync || sync->uselessSync)
continue;
if (!sync->GetForEndIndex().has_value())
continue;
if (ScopePair(sync) == scopePair)
return true;
}
for (auto *sync : element->pipeAfter) {
if (!sync || sync->uselessSync)
continue;
if (!sync->GetForEndIndex().has_value())
continue;
if (ScopePair(sync) == scopePair)
return true;
}
}
return false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The scopePairHasLoopCarriedSync function iterates over the entire syncIR_ structure, resulting in $O(N)$ complexity. Since this function is called within WidenEventId, which is itself invoked in a loop over syncIR_ in Allocate, the overall complexity of the allocation pass becomes $O(N^2)$. For large kernels with many operations, this could lead to significant compilation overhead.

Consider pre-calculating the set of scope pairs that contain loop-carried syncs once at the beginning of the Allocate method (after the initial allocation step) and storing it in a local set or a member variable to allow $O(1)$ lookups during the widening phase.

@zhangstevenunity zhangstevenunity marked this pull request as ready for review April 4, 2026 12:27
@zhangstevenunity
Copy link
Copy Markdown
Collaborator Author

/run a5 test/basic/issue428_cube_sync_regression.pto

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:c48bd1a067eb
  • 结果汇总:OK 0 / FAIL 1 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260404_202813_manual_pr443.log
  • 手动指令:/run a5 test/basic/issue428_cube_sync_regression.pto
  • 触发人:zhangstevenunity
  • 指定用例:test/basic/issue428_cube_sync_regression.pto
  • 触发评论:[codex] fix Cube kernel hang in generated C++ #443 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • issue428_cube_sync_regression (run, exit=2)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #443

issue428_cube_sync_regression

stage=run info=exit=2

/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:145:12: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_MTE1, EVENT_ID2);
           ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:145:20: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_MTE1, EVENT_ID2);
                   ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:146:12: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_MTE1, EVENT_ID3);
           ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:146:20: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_MTE1, EVENT_ID3);
                   ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:147:12: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  set_flag(PIPE_FIX, PIPE_M, EVENT_ID2);
           ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:147:22: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_FIX, PIPE_M, EVENT_ID2);
                     ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:148:12: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  set_flag(PIPE_FIX, PIPE_M, EVENT_ID6);
           ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:148:22: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_FIX, PIPE_M, EVENT_ID6);
                     ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:150:23: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);
                      ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:151:24: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  wait_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);
                       ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:154:12: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  set_flag(PIPE_MTE1, PIPE_M, EVENT_ID0);
           ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:154:23: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_MTE1, PIPE_M, EVENT_ID0);
                      ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:155:13: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  wait_flag(PIPE_MTE1, PIPE_M, EVENT_ID0);
            ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:155:24: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  wait_flag(PIPE_MTE1, PIPE_M, EVENT_ID0);
                       ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:157:12: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_FIX, EVENT_ID0);
           ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:157:20: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_FIX, EVENT_ID0);
                   ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:158:12: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_MTE1, EVENT_ID0);
           ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:158:20: error: the ranges of 2nd parameter must be [0, 1], [4, 5]
  set_flag(PIPE_M, PIPE_MTE1, EVENT_ID0);
                   ^
/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/npu_validation/basic/issue428_cube_sync_regression/issue428_cube_sync_regression_kernel.cpp:159:13: error: the ranges of 1st parameter must be [0, 1], [4, 5]
  wait_flag(PIPE_M, PIPE_FIX, EVENT_ID0);
            ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
gmake[2]: *** [CMakeFiles/issue428_cube_sync_regression_kernel.dir/build.make:76: CMakeFiles/issue428_cube_sync_regression_kernel.dir/issue428_cube_sync_regression_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/issue428_cube_sync_regression_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-04-04 20:30:00] ERROR: testcase failed (exit 2): issue428_cube_sync_regression
[2026-04-04 20:30:00] === SUMMARY ===
[2026-04-04 20:30:00] OK=0 FAIL=1 SKIP=0
[2026-04-04 20:30:00] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260404_202813_manual_pr443/remote_npu_validation_results.tsv

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7198be31a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +486 to +487
if (!scopePairHasLoopCarriedSync(scopePair))
reallocatedPipePair.insert(scopePair);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reallocate non-loop syncs even when scope has loop-carried deps

This guard skips reallocatedPipePair for the entire (src,dst) scope as soon as any loop-carried sync exists in that scope, so unrelated non-loop SET_EVENT syncs in the same pipe pair can no longer use the reallocation path when they run out of IDs. In that case they fall through to ChangeNoEventIdSyncToPipeAll and are downgraded to PIPE_ALL, which serializes execution globally and can cause major performance regressions for mixed loop/non-loop kernels that previously stayed on event-based sync.

Useful? React with 👍 / 👎.

@zhangstevenunity
Copy link
Copy Markdown
Collaborator Author

/run a3 test/basic/issue428_cube_sync_regression.pto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] generated C++ Cube kernel hangs with version >= 0.18

2 participants