N:m QP sharing: replace comm grouping with pool-based architecture by bharatb226 · Pull Request #20 · ROCm/amd-anp

bharatb226 · 2026-05-08T00:57:17Z

Summary

Replaces the comm grouping QP sharing implementation with a pool-based architecture that supports multi-NIC topologies and per-QP refcounted lifecycle.

What changed

Data structures — Old anpCommGroup/anpCommGroupKey (single QP+CQ per group, PID-scoped keys) replaced with anpSharedQp/anpSharedQpKey (per-QP pool entries keyed by local NIC, peer address, remote NIC index, direction, group index, and QP slot index). Separate refcounts for QPs (refcount) and CQs (cqRefcount).

Connect/accept handshake — Primary/secondary role determined by pool lookup on (peerAddr, groupIdx, qpIdx=0). Primary creates depth-scaled CQs and all QPs for the group. Secondaries map their QP count to the primary's via modulo, reuse primary's QPs and CQs (destroying their own), and skip RTR/RTS transitions.

Completion routing — wr_id upper 16 bits encode the sender's commId. imm_data encodes (req_idx << 16) | receiverCommId, allowing direct recv completion routing to the correct communicator and request slot without FIFO-order dependency. Send completions route via wr_id; recv completions route via imm_data.

Teardown — Full-key pool lookup (not QPN) for deregistration. CQ destruction deferred via cqRefcount and deduplicated across pool entries. anpFreeCommId() releases comm table slots.

CTS receiver offload — Enabled (CTS_RCVR_OFFLOAD_ENABLED). Sender bypasses FIFO-based CTS handshake; NIC hardware handles CTS routing.

Parameter defaults — AnpQpDepthMultiplier default changed from 1 to 4. Removed unused device fields (maxQpWr, maxCqe) and parameters (IbDataDirect, IbQpsPerConn, IbQpsPerP2p, IbAbortOnError).

Removed groupRecvDone — Old recv completion used a pending-request FIFO with groupRecvDone flag. New design routes directly via imm_data req_idx to the target request slot.

Removed anpPrintSharingSummary — Debug-only function that scanned the shared QP pool under mutex. Can be re-added when needed.

Test plan

No-sharing AllReduce (QPS=4): ~140 GB/s
Sharing AllReduce (Groups=4, Depth=2, QPS=1): ~140 GB/s
Full 14-config × 10-collective sweep: no regressions; alltoall/alltoallv show ~13-15% improvement with sharing

🤖 Generated with Claude Code

…ed (#106) (ROCm#18) (ROCm#19)" This reverts commit cbc951d.

This reverts commit 4b71c4f.

Rewrite N:m QP sharing to use a pool-based architecture with composite keys, refcounted QP/CQ lifecycle, and commId-based completion routing via wr_id and imm_data encoding. Remove anpPrintSharingSummary (debug-only, can be re-added when needed). Use rcclParamAnpCommNGroups() > 0 as the consistent sharing gate. Enable CTS receiver offload. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

These were dropped by the revert commits but are still used in the completion polling path. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

karthikarum · 2026-05-08T02:15:37Z

 RCCL_PARAM(IbAbortOnError, "IB_ABORT_ON_ERROR", 0);
 RCCL_PARAM(AnpCommNGroups, "ANP_COMM_NGROUPS", 0);
-RCCL_PARAM(AnpQpDepthMultiplier, "ANP_QP_DEPTH_MULTIPLIER", 1);
+RCCL_PARAM(AnpQpDepthMultiplier, "ANP_QP_DEPTH_MULTIPLIER", 4);


Default depth multiplier to be 1 right ?

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

When QP sharing is enabled (groupIdx >= 0), assign UDMA masks based on group index modulo 2 instead of channel-based tracking. This gives deterministic, balanced UDMA distribution across sharing groups. Non-sharing path (groupIdx < 0) retains existing channel-based logic. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

bharatb226 and others added 4 commits May 7, 2026 19:00

Revert "release devbase after all non-owner comm references are remov…

1e55715

…ed (#106) (ROCm#18) (ROCm#19)" This reverts commit cbc951d.

Revert "Support for comm groups (shared comms) (ROCm#17)"

b647ad7

This reverts commit 4b71c4f.

Revert Makefile NDEBUG change to match upstream

e85b0fa

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

bharatb226 changed the title ~~N:m QP sharing redesign with CTS RX offload~~ N:m QP sharing: replace comm grouping with pool-based architecture May 8, 2026

Restore IbAbortOnError param and anp_ibv_poll_cq function

2a93426

These were dropped by the revert commits but are still used in the completion polling path. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

karthikarum reviewed May 8, 2026

View reviewed changes

bharatb226 and others added 3 commits May 8, 2026 02:20

Change AnpQpDepthMultiplier default from 4 to 1

541aadc

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Enable CTS receiver offload

0f87eab

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N:m QP sharing: replace comm grouping with pool-based architecture#20

N:m QP sharing: replace comm grouping with pool-based architecture#20
bharatb226 wants to merge 8 commits intoROCm:rel-pvt-M-1from
bharatb226:m1

bharatb226 commented May 8, 2026 •

edited

Loading

Uh oh!

karthikarum May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bharatb226 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Test plan

Uh oh!

karthikarum May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bharatb226 commented May 8, 2026 •

edited

Loading