Skip to content

Multi-phased Hierarchical Barrier#1229

Draft
bcmIntc wants to merge 1 commit into
Sandia-OpenSHMEM:mainfrom
bcmIntc:bcm_cp_smartAtomics
Draft

Multi-phased Hierarchical Barrier#1229
bcmIntc wants to merge 1 commit into
Sandia-OpenSHMEM:mainfrom
bcmIntc:bcm_cp_smartAtomics

Conversation

@bcmIntc
Copy link
Copy Markdown
Collaborator

@bcmIntc bcmIntc commented Apr 23, 2026

Summary

Adds a three-phase hierarchical barrier (--enable-hierarchical-barrier) that keeps intranode traffic off the NIC: phases 1 and 3 use CPU atomics over XPMEM; phase 2 restricts NIC puts to node roots only.

  • Phase 1 (gather): PEs signal up a k-ary tree. Each PE writes to its own cache-line-padded slot (HIER_SLOT_STRIDE=8 longs), so no two PEs share a line — eliminating
    the MESI serialization of a shared counter.
  • Phase 2 (internode): node roots run a put-based binary dissemination. Each round's slot is reset via CPU store rather than a self-put, saving ceil(log2(N_nodes)) NIC
    round-trips per barrier.
  • Phase 3 (fanout): node root stores an ack to each child's down-slot; children relay down the tree with reset-before-signal ordering.

AUTO selection activates when local PE count ≥ SHMEM_HIER_BARRIER_THRESHOLD (default 2). Also selectable via SHMEM_BARRIER_ALGORITHM=hierarchical.

Test plan

@bcmIntc bcmIntc self-assigned this Apr 23, 2026
@bcmIntc bcmIntc force-pushed the bcm_cp_smartAtomics branch 4 times, most recently from 80162ed to 82cca95 Compare April 29, 2026 16:03
@bcmIntc bcmIntc changed the title Dual-Plane Atomics and Hierarchical Barrier Atomic-based Hierarchical Barrier Apr 29, 2026
@bcmIntc bcmIntc force-pushed the bcm_cp_smartAtomics branch 4 times, most recently from 5153417 to af29269 Compare May 1, 2026 01:31
@bcmIntc bcmIntc changed the title Atomic-based Hierarchical Barrier Multi-phased Hierarchical Barrier May 1, 2026
@bcmIntc bcmIntc force-pushed the bcm_cp_smartAtomics branch 3 times, most recently from 02d3ea1 to a69dc25 Compare May 4, 2026 12:32
Adds --enable-hierarchical-barrier, a three-phase barrier that keeps
intranode traffic off the NIC by using CPU atomics over XPMEM for
gather/fanout and restricts NIC puts to the internode phase (node roots
only).

Phase 1 (intranode gather): local PEs signal up a k-ary tree. Each PE
writes to its OWN up-slot in local_pSync; the parent reads each child's
slot individually. Slots are padded to one cache line (HIER_SLOT_STRIDE=8
longs, 64 bytes) so no two PEs share a line, eliminating the MESI
serialization that would occur if all children wrote to a single counter.
Signal values increment monotonically via hier_sense, avoiding explicit
slot resets between calls (sense alternation).

Phase 2 (internode dissemination): node roots run a put-based binary
dissemination across the NIC. After each round the slot is reset via a
CPU store rather than a self-put, saving ceil(log2(N_nodes)) NIC
round-trips per barrier (12 at 4096 nodes).

Phase 3 (intranode fanout): node root CPU-stores an ack into each child's
down-slot; children relay down the k-ary tree with reset-before-signal
ordering. Down-slots are in the upper half of local_pSync, laid out with
the same per-PE cache-line padding as up-slots.

AUTO selection activates when local PE count >= SHMEM_HIER_BARRIER_THRESHOLD
(default 2). Also selectable via SHMEM_BARRIER_ALGORITHM=hierarchical.

New infrastructure:
- src/shr_transport.h  — XPMEM CPU pointer mapping; self-access returns
  the address directly without an XPMEM lookup
- src/runtime_util.c  — global hostname exchange so each PE can identify
  its node root
- configure.ac  — --enable-hierarchical-barrier requires --with-xpmem and
  a network transport
@bcmIntc bcmIntc force-pushed the bcm_cp_smartAtomics branch from a69dc25 to dbd6c8e Compare May 4, 2026 14:18
@markbrown314 markbrown314 added this to the v1.6.0-perf-r2 milestone May 4, 2026
@markbrown314 markbrown314 self-requested a review May 4, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants