Skip to content

[InsertSync] scalar autosync + addptr alias-through (scalar-only scope)#206

Open
TaoTao-real wants to merge 6 commits intohw-native-sys:mainfrom
TaoTao-real:codex/load-store-scalar-pipe-s-sync
Open

[InsertSync] scalar autosync + addptr alias-through (scalar-only scope)#206
TaoTao-real wants to merge 6 commits intohw-native-sys:mainfrom
TaoTao-real:codex/load-store-scalar-pipe-s-sync

Conversation

@TaoTao-real
Copy link
Copy Markdown
Contributor

@TaoTao-real TaoTao-real commented Mar 6, 2026

Summary

  • make pto.load_scalar / pto.store_scalar participate in InsertSync as PIPE_S ops
  • add pto.addptr alias-through in InsertSync translator, including constant-offset byte folding and conservative handling for dynamic offsets
  • keep scalar sync behavior aligned with AscendNPU-IR semantics:
    • skip PIPE_S -> PIPE_S local sync insertion
    • keep scalar cross-pipe dependencies as normal set_flag/wait_flag sync (no forced PIPE_ALL fallback)
  • add scalar sync regressions and runop structural checks:
    • test_inject_sync_scalar_cross_pipe.py
    • test_inject_sync_scalar_intra_pipe_barrier.py

Scope

This PR is intentionally scalar-sync only.
Layout-inference changes (e.g. rowmin/rowsum/rowmax related fixes) are explicitly excluded from this PR.

Motivation

InsertSync previously could miss synchronization around scalar load/store paths and addptr alias chains, which risks under-synchronization in scalar-memory-involved dependencies.

Design

  • LoadScalarOp / StoreScalarOp
    • add OpPipeInterface
    • return PIPE_S in getPipe()
  • InsertSync analysis
    • same-pipe PIPE_S is skipped (in-order scalar pipeline)
    • scalar cross-pipe dependencies are synchronized via set_flag/wait_flag
  • alias propagation
    • add pto.addptr to alias chain updates
    • support constant addptr offsets in byte-range analysis
    • dynamic addptr offsets use conservative alias modeling (prefer over-sync to under-sync)

Testing

Local regression:

source /Users/lishengtao/Workspace/PTO/env.sh
export PTO_PYTHON_ROOT=/Users/lishengtao/Documents/PTO/PTOAS_upstream_main_main/build-main/python
pto_export_runtime
PTOAS_BIN=/Users/lishengtao/Documents/PTO/PTOAS_pr206_scalar_only/build-main/tools/ptoas/ptoas \
  bash test/samples/runop.sh -t Sync

Result: pass (A5-only sample skipped as expected).

A3 remote board validation (latest, commit 6c8c44e):

  • Host: 101.245.68.6 (DEVICE_ID=10, SOC_VERSION=Ascend910)
  • Flow: CI-equivalent test/npu_validation/scripts/run_remote_npu_validation.sh
  • Result: OK=19, FAIL=0, SKIP=1
  • Key scalar cases:
    • test_inject_sync_scalar_cross_pipe: OK
    • test_inject_sync_scalar_intra_pipe_barrier: OK

Risk / Rollback

  • Risk: scalar-related sync coverage expands compared with previous scalar-untracked behavior
  • Mitigation: same-pipe PIPE_S remains skipped; cross-pipe behavior follows existing set/wait model and has local + remote regression coverage
  • Rollback: revert this PR (changes are localized to InsertSync + Sync tests)

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

Added commit 269dedb for the long-term row-reduction layout fix and validated on A3 board before PR submission.

A3 remote validation (CI script flow):

  • Script: test/npu_validation/scripts/run_remote_npu_validation.sh
  • Env: STAGE=run, RUN_MODE=npu, SOC_VERSION=Ascend910, DEVICE_ID=2
  • Run only: rowmin,rowsum,rowmax

Result:

  • rowmin: OK
  • rowsum: OK
  • rowmax: OK
  • Summary: OK=3 FAIL=0

Extra note: when additionally running tensor_view_layout_dn, it fails with a pto-isa static assert (DN GlobalTensor with RowMajor MatTile) on A3; this case is orthogonal to the row-reduction fix above.

@TaoTao-real TaoTao-real changed the title [InsertSync] support scalar PIPE_S autosync and addptr alias-through [WIP] [InsertSync] support scalar PIPE_S autosync and addptr alias-through Mar 6, 2026
@TaoTao-real TaoTao-real force-pushed the codex/load-store-scalar-pipe-s-sync branch from 269dedb to 40cee3b Compare March 6, 2026 08:30
@TaoTao-real TaoTao-real changed the title [WIP] [InsertSync] support scalar PIPE_S autosync and addptr alias-through [InsertSync] scalar autosync + addptr alias-through (scalar-only scope) Mar 6, 2026
@zhangstevenunity
Copy link
Copy Markdown
Collaborator

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c8c44e6cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +553 to +557
if (hasUnknownAliasRange) {
// Dynamic pointer arithmetic cannot be modeled precisely here.
// Keep root/scope aliasing, but drop concrete range info conservatively.
newInfo->baseAddresses.clear();
newInfo->allocateSize = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep unknown alias ranges conservative for local memory

Clearing baseAddresses and forcing allocateSize = 0 for unknown alias ranges is not conservative once that value flows into a later alias op with a known offset but unknown size (for example memref.reinterpret_cast): the later op repopulates baseAddresses but keeps allocateSize==0, and MemoryDependentAnalyzer::MemAlias then treats local-memory ranges as non-overlapping when both address lists are non-empty, which can drop real dependencies and skip required sync insertion. Please preserve an explicit “unknown range” state across descendants (or make local allocateSize==0 behave like GM unknown-size overlap).

Useful? React with 👍 / 👎.

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d668c6fde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +548 to +551
if (info.first == -1) {
hasUnknownAliasRange = true;
} else {
deltaOffset = info.first;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat negative addptr offsets as unknown alias ranges

getStaticOffsetAndSize can return negative byte deltas for pto.addptr, but this path only treats -1 as “unknown” and otherwise applies the delta to baseAddresses (a uint64_t). For valid IR with negative element offsets (the addptr verifier does not forbid them), subtracting from a low base can underflow and wrap to a huge unsigned address, making two aliasing accesses look disjoint. In that scenario InsertSync can miss a real dependency and skip required synchronization, so negative deltas should be handled conservatively (e.g., mark unknownRange) or with checked signed range math before updating addresses.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants