Skip to content

Perf: replace read_reg with poll_reg in COND polling loops#428

Open
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:perf/barrier-free-cond-poll
Open

Perf: replace read_reg with poll_reg in COND polling loops#428
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:perf/barrier-free-cond-poll

Conversation

@chenshengxin2026
Copy link
Copy Markdown
Contributor

Summary

  • Add poll_reg() — a barrier-free volatile read — for hot COND register polling loops in AICPU executors
  • Add poll_acquire_barrier() macro (dmb ish on ARM64, compiler barrier on x86_64) inserted once on the completion path after the awaited condition is detected
  • Replace read_reg()poll_reg() in all COND polling sites across all 5 runtimes (a2a3: aicpu_build_graph, host_build_graph, tensormap_and_ringbuffer; a5: host_build_graph, tensormap_and_ringbuffer)

The barrier cost is now O(1) per task completion instead of O(poll iterations), eliminating dmb overhead on every iteration of the "not-yet-done" hot path.

Testing

  • Simulation tests pass: a2a3sim 13/13, a5sim 2/2
  • Hardware tests pass (no idle devices available at time of commit)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new poll_reg function and a poll_acquire_barrier macro to optimize register polling across the a2a3 and a5 platforms. The poll_reg function allows for low-overhead volatile reads within hot loops by omitting memory barriers, while the poll_acquire_barrier ensures proper memory synchronization once the polling condition is satisfied. These primitives have been integrated into several executor components to improve performance. I have no further feedback to provide as no review comments were present.

@chenshengxin2026 chenshengxin2026 force-pushed the perf/barrier-free-cond-poll branch 7 times, most recently from c79b5a4 to 07401c3 Compare April 2, 2026 03:06
Add poll_reg() — a barrier-free volatile read — for use in hot spin-wait
loops that poll the AICore COND register. Add poll_acquire_barrier() (dmb
ish on ARM64, compiler barrier on x86_64) inserted once on the cold path
when the awaited condition is detected.

- platform (a2a3, a5): add poll_reg() declaration and implementation;
  add poll_acquire_barrier() macro to memory_barrier.h
- runtimes (host_build_graph, aicpu_build_graph, tensormap_and_ringbuffer
  on both a2a3 and a5): replace read_reg() → poll_reg() for the COND
  register reads inside the polling loop; insert poll_acquire_barrier()
  at each completion branch before accessing Normal memory

The barrier cost is now O(1) per task completion instead of O(iterations),
eliminating dmb overhead on every iteration of the "not-yet-done" hot path.
@chenshengxin2026 chenshengxin2026 force-pushed the perf/barrier-free-cond-poll branch from 07401c3 to 694bb60 Compare April 2, 2026 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant