Skip to content

Add: require_sync_start for atomic SPMD block launch#448

Merged
ChaoWao merged 2 commits intohw-native-sys:mainfrom
poursoul:spmd-dev
Apr 3, 2026
Merged

Add: require_sync_start for atomic SPMD block launch#448
ChaoWao merged 2 commits intohw-native-sys:mainfrom
poursoul:spmd-dev

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

@poursoul poursoul commented Apr 3, 2026

Introduce a sync_start mechanism that forces all blocks of an SPMD task to be dispatched atomically before any can begin execution.

Submission layer (pto_submit_types.h, pto_orchestrator.cpp/h):

  • Add LaunchSpec::require_sync_start and active_mask bit-3 flag
  • Add pto2_core_mask() / pto2_requires_sync_start() helpers
  • Validate block_num < total resources at submit time to prevent deadlock
  • Fix total_required_subtasks to use pto2_core_mask (strip flag bits)

Scheduler drain protocol (aicpu_executor.cpp):

  • Three-phase drain: ack barrier → global resource check → exclusive dispatch
  • Elected thread verifies global idle resources before dispatching; if insufficient, all threads return to completion polling and retry
  • Non-elected threads spin-wait during dispatch, giving the elected thread exclusive CoreTracker access (no data race on core_states_)
  • Track active_sched_threads_ separately from thread_num_ so orchestrator threads that have not transitioned to scheduling do not block the ack barrier

SPMD dispatch refactor:

  • Extract dispatch_block_to_cluster / dispatch_mix_block_to_cluster
  • AIV path uses count_idle_aiv_cores for accurate resource counting

Test examples: spmd_sync_start, spmd_sync_start_aiv, spmd_sync_start_edge, spmd_sync_start_stress, spmd_starvation

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the sync_start mechanism for SPMD tasks, ensuring atomic launching of all blocks. It introduces a drain protocol in the AicpuExecutor to coordinate resource allocation across scheduler threads and prevent starvation. The submission logic was updated to handle the require_sync_start flag, and a suite of golden tests was added to verify the feature's correctness and robustness. Feedback identifies a potential undefined behavior in bitwise shifts for thread masks and suggests adjusting the deadlock check to allow tasks that require exactly the total available resources.

@poursoul poursoul force-pushed the spmd-dev branch 4 times, most recently from a2a3a68 to 572bfd7 Compare April 3, 2026 09:53
ChaoWao
ChaoWao previously approved these changes Apr 3, 2026
Introduce a sync_start mechanism that forces all blocks of an SPMD task
to be dispatched atomically before any can begin execution.

Submission layer (pto_submit_types.h, pto_orchestrator.cpp/h):
- Add LaunchSpec::require_sync_start and active_mask bit-3 flag
- Add pto2_core_mask() / pto2_requires_sync_start() helpers
- Validate block_num < total resources at submit time to prevent deadlock
- Fix total_required_subtasks to use pto2_core_mask (strip flag bits)

Scheduler drain protocol (aicpu_executor.cpp):
- Three-phase drain: ack barrier → global resource check → exclusive
  dispatch
- Elected thread verifies global idle resources before dispatching; if
  insufficient, all threads return to completion polling and retry
- Non-elected threads spin-wait during dispatch, giving the elected
  thread exclusive CoreTracker access (no data race on core_states_)
- Track active_sched_threads_ separately from thread_num_ so
  orchestrator threads that have not transitioned to scheduling do not
  block the ack barrier

SPMD dispatch refactor:
- Extract dispatch_block_to_cluster / dispatch_mix_block_to_cluster
- AIV path uses count_idle_aiv_cores for accurate resource counting

Test examples: spmd_sync_start, spmd_sync_start_aiv,
spmd_sync_start_edge, spmd_sync_start_stress, spmd_starvation
@ChaoWao
Copy link
Copy Markdown
Collaborator

ChaoWao commented Apr 3, 2026

#441

Port the complete require_sync_start / drain mode implementation from
a2a3 to a5 tensormap_and_ringbuffer runtime:

- pto_submit_types.h: add PTO2_SUBTASK_FLAG_SYNC_START, pto2_core_mask,
  pto2_requires_sync_start; fix pto2_active_mask_to_shape to strip flag
  bits; extend PTO2LaunchSpec with require_sync_start
- pto_orchestrator: add total_cluster_count/total_aiv_count for deadlock
  detection; encode sync_start flag in active_mask at submit time; fix
  total_required_subtasks popcount to use pto2_core_mask
- aicpu_executor: add SyncStartDrainState, active_sched_threads,
  count_idle_aiv_cores, three-phase drain protocol (ack barrier, global
  resource check, exclusive dispatch); modify scheduler main loop with
  drain check and sync_start fast/slow path branching
- Add 5 test examples: spmd_sync_start, spmd_sync_start_aiv,
  spmd_sync_start_edge, spmd_sync_start_stress, spmd_starvation
@ChaoWao ChaoWao merged commit d5990a8 into hw-native-sys:main Apr 3, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants