Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage#15
Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage#15karthikarum wants to merge 2 commits intoROCm:mainfrom
Conversation
Introduce a shared QP pool that allows multiple NCCL channels targeting
the same peer to reuse a single IB QP, controlled by NCCL_ANP_QP_SHARING.
Channels are assigned to m groups (NCCL_ANP_QP_SHARING_GROUPS) via a
hash of {device, peer address, direction, group index}. The first
channel in each group creates and owns the QP; subsequent channels
increment refcount and share it. Completion routing uses commId-encoded
wr_id (upper 16 bits) and imm_data to dispatch CQEs to the correct
comm. Teardown is reference-counted with separate QP and CQ lifetimes.
Key architectural components:
1. Shared QP Pool (anpSharedQp / g_sharedQpPool)
- Global pool of up to 512 shared QP entries, keyed by
{ibDevN, remoteGid, connectAddr, isSend, groupIdx}
- FNV-1a hash of the key produces a shareGroupId exchanged during
connection metadata
- Reference-counted: first channel becomes owner and creates the QP;
subsequent channels increment refcount and reuse it
- Separate refcount (QP lifetime) and cqRefcount (CQ/device-base lifetime)
2. Comm Routing Table (g_commTable)
- Each comm is assigned a 16-bit commId stored in a global table
- wr_id encoded as [commId:16 | payload:48] to identify which comm
owns a given completion
- On recv side, imm_data carries the sender's remote commId to route
RDMA_WRITE_WITH_IMM completions to the correct comm
3. Pending Recv Queue
- Shared QPs can't use per-device event counting for recv completion
tracking (RECV WQE may be consumed by any sharer)
- Each comm maintains a FIFO of pending recv requests
(pendingRecvReqs[]), completions matched via imm_data routing +
FIFO ordering
- sharedRecvDone flag on requests replaces event-count-based
completion detection
4. Connection Setup (Connect/Accept)
- Sender side: looks up existing shared QP by key; if found, reuses
it (skips RTR/RTS); if not, creates new one and registers as owner
- Receiver side: matches on shareGroupId from sender metadata; same
owner/sharer logic
- shareGroupId, commId, and sharedGroupIdx exchanged in
ncclIbConnectionMetadata
5. Teardown (CloseSend/CloseRecv)
- Shared QP: decrement refcount; destroy QP only when last sharer
closes; destroy CQ/device-base only when cqRefcount reaches 0
- Non-shared: unchanged original path
Environment variables (all opt-in, disabled by default):
NCCL_ANP_QP_SHARING=0 Enable QP sharing
NCCL_ANP_QP_SHARING_DEPTH=4 WQ depth multiplier for shared QPs
NCCL_ANP_QP_SHARING_DISABLE_CTS=1 Disable CTS receiver offload on shared QPs
NCCL_ANP_QP_SHARING_GROUPS=2 Number of sharing groups (m in N:m)
Other changes:
- CTS inline / receiver offload disabled by default (compile-time
defines commented out)
- Device capability caching: maxQpWr and maxCqe stored per device;
shared QP/CQ depths clamped to hardware limits
- UDMA mask steering: shared QPs alternate between high/low UDMA masks
based on groupIdx % 2
- Logging upgrades: several TRACE calls promoted to INFO for QP
lifecycle visibility
- wr_id encoding fix: bitwise OR instead of addition for request index
packing (correctness fix for commId-in-upper-bits encoding)
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
| NCCL_PARAM(AnpQpSharing, "ANP_QP_SHARING", 0); | ||
| NCCL_PARAM(AnpQpSharingDepth, "ANP_QP_SHARING_DEPTH", 4); | ||
| NCCL_PARAM(AnpQpSharingDisableCts, "ANP_QP_SHARING_DISABLE_CTS", 1); | ||
| NCCL_PARAM(AnpQpSharingGroups, "ANP_QP_SHARING_GROUPS", 2); |
There was a problem hiding this comment.
Are we getting good perf numbers even with 2?
There was a problem hiding this comment.
with ANP_QP_SHARING_GROUPS to 2, was getting peak busbw of 344 Gbps for AllReduce-16G msg.
Only with ANP_QP_SHARING_GROUPS set to 4, we hit the peak 357 Gbps.
just wanted to set to default even number, so we have distributed across UD.
| //#define ANP_DEBUG_TRACE_EN | ||
| #define CTS_INLINE_ENABLED | ||
| #define CTS_RCVR_OFFLOAD_ENABLED | ||
| //#define CTS_INLINE_ENABLED |
There was a problem hiding this comment.
Ideally the inline is independent of RCVR offload. We should eventually try enabling cts inline.
There was a problem hiding this comment.
right, the PR needs bit of a cleanup and re-test for CTS RCVR offload. Since originally we started these experiments disabling offload, just sticked to it. anyway if sharing enabled it will disable CTS RCVR offload at runtime because of the env ANP_QP_SHARING_DISABLE_CTS defaulted to 1.
Follow-up question would be, for QP sharing do we need to support CTS RCVR offload enabled ?
There was a problem hiding this comment.
In the current design we cant support CTS RCVR offload with QP sharing.
| // CQ is sized to accommodate the max SQ + RQ WQE completions. If each SQ WQE could be signaled, then, | ||
| // for each QP, there can be 2*MAX_REQUESTS completions for SQ and MAX_REQUESTS completions for RQ. | ||
| NCCLCHECK(wrap_ibv_create_cq(&base->cq, ibDev->context, 3*MAX_REQUESTS*ncclParamIbQpsPerConn(), cq_context, NULL, 0)); | ||
| int cqDepth = 3 * MAX_REQUESTS * ncclParamIbQpsPerConn(); |
There was a problem hiding this comment.
follow-up: do we need to create CQ for non-owner channels.
| } | ||
| ncclIbDevs[ncclNIbDevs].maxQp = devAttr.max_qp; | ||
| ncclIbDevs[ncclNIbDevs].maxQpWr = devAttr.max_qp_wr; | ||
| ncclIbDevs[ncclNIbDevs].maxCqe = devAttr.max_cqe; |
There was a problem hiding this comment.
Is max_cqe not being used earlier and is used only with these changes?
| // ============================================================ | ||
|
|
||
| struct anpSharedQpKey { | ||
| int ibDevN; |
| int ibDevN; | ||
| union ibv_gid remoteGid; | ||
| union ncclSocketAddress connectAddr; | ||
| bool isSend; |
| // QP Sharing Infrastructure | ||
| // ============================================================ | ||
|
|
||
| struct anpSharedQpKey { |
There was a problem hiding this comment.
So in my understanding the key should identify uniqely a pair of ranks. the connect_addr and remote_gid identifies the remote rank. What identifies the local rank?
| struct ibv_cq* ownerCq; | ||
| struct ncclIbNetCommDevBase* ownerDevBase; | ||
| int ownerIbDevN; | ||
| int refcount; |
There was a problem hiding this comment.
Can we add a comment as to what this ref count is?
… tracing
Crash fix:
- Add senderListenId (PID) to ncclIbConnectionMetadata and ncclIbHandle
to disambiguate peers with identical shareGroupId hashes
- Replace hash-based anpFindSharedQpByShareGroup() with full-key lookup
using anpFindSharedQp() on accept side — prevents two different senders
from colliding into the same shared QP
- Refactor anpSharedQpKey: remove remoteGid, rename connectAddr→peerAddr,
add peerListenId — key is now {ibDevN, peerAddr(stripped), peerListenId,
isSend, groupIdx}
- Accept-side registration uses the same lookup key used for the find,
eliminating the incomplete key that was missing peerAddr and peerListenId
- anpCountPeerTotalRefcount now scopes by peerListenId
QP sharing decision tracing (5 WARN logs, prefix NET/ANP/QPS:):
- Connect lookup: full key, REUSE/NEW decision, QPN, refcount
- Connect REGISTERED: pool entry details for new QP owners
- Connect SEND meta: metadata sent to receiver for correlation
- Accept RECV meta: metadata received from sender
- Accept lookup: accept-side key, REUSE/NEW decision, QPN, refcount
* Support Multiple QP sharing groups extension and distributed owner sharing/UD pinning
Squashed from PR #15 (2 commits): - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage - Fix accept-side QP sharing hash collision + add WARN-level decision tracing Cleanup and refactor comm grouping (PR#15 follow-up) Rename all QP sharing references to comm grouping mental model: - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant) Consolidate env variables from 4 to 2: - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled) - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs) - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0) - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1) - Simplify queue depth from ceil(D/m) to direct multiplier Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot). Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop. Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc). (cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff) Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
* Plugin update (#14) Compatible RCCL: https://github.com/ROCm/rccl/tree/develop (commit hash 420b3b840e0324ea897db7f04028471a4ea830d7) Pen amd-anp sha: 96ba08c5d900e286a5dc8d50cb84da4438adb662 (cherry picked from commit 1243259) * Add N:m QP sharing for AMD ANP plugin (#105) (#16) Squashed from PR #15 (2 commits): - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage - Fix accept-side QP sharing hash collision + add WARN-level decision tracing Cleanup and refactor comm grouping (PR#15 follow-up) Rename all QP sharing references to comm grouping mental model: - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant) Consolidate env variables from 4 to 2: - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled) - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs) - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0) - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1) - Simplify queue depth from ceil(D/m) to direct multiplier Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot). Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop. Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc). (cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff) Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com> (cherry picked from commit 5e15f51) --------- Co-authored-by: Karthikeyan Arumugam <karthik@pensando.io> Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
Introduce a shared QP pool that allows multiple NCCL channels targeting the same peer to reuse a single IB QP, controlled by NCCL_ANP_QP_SHARING. Channels are assigned to m groups (NCCL_ANP_QP_SHARING_GROUPS) via a hash of {device, peer address, direction, group index}. The first channel in each group creates and owns the QP; subsequent channels increment refcount and share it. Completion routing uses commId-encoded wr_id (upper 16 bits) and imm_data to dispatch CQEs to the correct comm. Teardown is reference-counted with separate QP and CQ lifetimes.
Key architectural components:
Shared QP Pool (anpSharedQp / g_sharedQpPool)
Comm Routing Table (g_commTable)
Pending Recv Queue
Connection Setup (Connect/Accept)
Teardown (CloseSend/CloseRecv)
Environment variables (all opt-in, disabled by default):
NCCL_ANP_QP_SHARING=0 Enable QP sharing
NCCL_ANP_QP_SHARING_DEPTH=4 WQ depth multiplier for shared QPs
NCCL_ANP_QP_SHARING_DISABLE_CTS=1 Disable CTS receiver offload on shared QPs
NCCL_ANP_QP_SHARING_GROUPS=2 Number of sharing groups (m in N:m)
Other changes:
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist