Skip to content

N:m QP sharing: replace comm grouping with pool-based architecture#20

Open
bharatb226 wants to merge 8 commits intoROCm:rel-pvt-M-1from
bharatb226:m1
Open

N:m QP sharing: replace comm grouping with pool-based architecture#20
bharatb226 wants to merge 8 commits intoROCm:rel-pvt-M-1from
bharatb226:m1

Conversation

@bharatb226
Copy link
Copy Markdown

@bharatb226 bharatb226 commented May 8, 2026

Summary

Replaces the comm grouping QP sharing implementation with a pool-based architecture that supports multi-NIC topologies and per-QP refcounted lifecycle.

What changed

Data structures — Old anpCommGroup/anpCommGroupKey (single QP+CQ per group, PID-scoped keys) replaced with anpSharedQp/anpSharedQpKey (per-QP pool entries keyed by local NIC, peer address, remote NIC index, direction, group index, and QP slot index). Separate refcounts for QPs (refcount) and CQs (cqRefcount).

Connect/accept handshake — Primary/secondary role determined by pool lookup on (peerAddr, groupIdx, qpIdx=0). Primary creates depth-scaled CQs and all QPs for the group. Secondaries map their QP count to the primary's via modulo, reuse primary's QPs and CQs (destroying their own), and skip RTR/RTS transitions.

Completion routingwr_id upper 16 bits encode the sender's commId. imm_data encodes (req_idx << 16) | receiverCommId, allowing direct recv completion routing to the correct communicator and request slot without FIFO-order dependency. Send completions route via wr_id; recv completions route via imm_data.

Teardown — Full-key pool lookup (not QPN) for deregistration. CQ destruction deferred via cqRefcount and deduplicated across pool entries. anpFreeCommId() releases comm table slots.

CTS receiver offload — Enabled (CTS_RCVR_OFFLOAD_ENABLED). Sender bypasses FIFO-based CTS handshake; NIC hardware handles CTS routing.

Parameter defaultsAnpQpDepthMultiplier default changed from 1 to 4. Removed unused device fields (maxQpWr, maxCqe) and parameters (IbDataDirect, IbQpsPerConn, IbQpsPerP2p, IbAbortOnError).

Removed groupRecvDone — Old recv completion used a pending-request FIFO with groupRecvDone flag. New design routes directly via imm_data req_idx to the target request slot.

Removed anpPrintSharingSummary — Debug-only function that scanned the shared QP pool under mutex. Can be re-added when needed.

Test plan

  • No-sharing AllReduce (QPS=4): ~140 GB/s
  • Sharing AllReduce (Groups=4, Depth=2, QPS=1): ~140 GB/s
  • Full 14-config × 10-collective sweep: no regressions; alltoall/alltoallv show ~13-15% improvement with sharing

🤖 Generated with Claude Code

bharatb226 and others added 4 commits May 7, 2026 19:00
Rewrite N:m QP sharing to use a pool-based architecture with composite
keys, refcounted QP/CQ lifecycle, and commId-based completion routing
via wr_id and imm_data encoding. Remove anpPrintSharingSummary (debug-only,
can be re-added when needed). Use rcclParamAnpCommNGroups() > 0 as the
consistent sharing gate. Enable CTS receiver offload.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@bharatb226 bharatb226 changed the title N:m QP sharing redesign with CTS RX offload N:m QP sharing: replace comm grouping with pool-based architecture May 8, 2026
These were dropped by the revert commits but are still used in the
completion polling path.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Comment thread src/net_ib.cc Outdated
RCCL_PARAM(IbAbortOnError, "IB_ABORT_ON_ERROR", 0);
RCCL_PARAM(AnpCommNGroups, "ANP_COMM_NGROUPS", 0);
RCCL_PARAM(AnpQpDepthMultiplier, "ANP_QP_DEPTH_MULTIPLIER", 1);
RCCL_PARAM(AnpQpDepthMultiplier, "ANP_QP_DEPTH_MULTIPLIER", 4);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default depth multiplier to be 1 right ?

bharatb226 and others added 3 commits May 8, 2026 02:20
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When QP sharing is enabled (groupIdx >= 0), assign UDMA masks based
on group index modulo 2 instead of channel-based tracking. This gives
deterministic, balanced UDMA distribution across sharing groups.

Non-sharing path (groupIdx < 0) retains existing channel-based logic.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants