Skip to content

Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage#15

Draft
karthikarum wants to merge 2 commits intoROCm:mainfrom
karthikarum:qp_sharing
Draft

Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage#15
karthikarum wants to merge 2 commits intoROCm:mainfrom
karthikarum:qp_sharing

Conversation

@karthikarum
Copy link
Copy Markdown
Collaborator

Introduce a shared QP pool that allows multiple NCCL channels targeting the same peer to reuse a single IB QP, controlled by NCCL_ANP_QP_SHARING. Channels are assigned to m groups (NCCL_ANP_QP_SHARING_GROUPS) via a hash of {device, peer address, direction, group index}. The first channel in each group creates and owns the QP; subsequent channels increment refcount and share it. Completion routing uses commId-encoded wr_id (upper 16 bits) and imm_data to dispatch CQEs to the correct comm. Teardown is reference-counted with separate QP and CQ lifetimes.

Key architectural components:

  1. Shared QP Pool (anpSharedQp / g_sharedQpPool)

    • Global pool of up to 512 shared QP entries, keyed by {ibDevN, remoteGid, connectAddr, isSend, groupIdx}
    • FNV-1a hash of the key produces a shareGroupId exchanged during connection metadata
    • Reference-counted: first channel becomes owner and creates the QP; subsequent channels increment refcount and reuse it
    • Separate refcount (QP lifetime) and cqRefcount (CQ/device-base lifetime)
  2. Comm Routing Table (g_commTable)

    • Each comm is assigned a 16-bit commId stored in a global table
    • wr_id encoded as [commId:16 | payload:48] to identify which comm owns a given completion
    • On recv side, imm_data carries the sender's remote commId to route RDMA_WRITE_WITH_IMM completions to the correct comm
  3. Pending Recv Queue

    • Shared QPs can't use per-device event counting for recv completion tracking (RECV WQE may be consumed by any sharer)
    • Each comm maintains a FIFO of pending recv requests (pendingRecvReqs[]), completions matched via imm_data routing + FIFO ordering
    • sharedRecvDone flag on requests replaces event-count-based completion detection
  4. Connection Setup (Connect/Accept)

    • Sender side: looks up existing shared QP by key; if found, reuses it (skips RTR/RTS); if not, creates new one and registers as owner
    • Receiver side: matches on shareGroupId from sender metadata; same owner/sharer logic
    • shareGroupId, commId, and sharedGroupIdx exchanged in ncclIbConnectionMetadata
  5. Teardown (CloseSend/CloseRecv)

    • Shared QP: decrement refcount; destroy QP only when last sharer closes; destroy CQ/device-base only when cqRefcount reaches 0
    • Non-shared: unchanged original path

Environment variables (all opt-in, disabled by default):
NCCL_ANP_QP_SHARING=0 Enable QP sharing
NCCL_ANP_QP_SHARING_DEPTH=4 WQ depth multiplier for shared QPs
NCCL_ANP_QP_SHARING_DISABLE_CTS=1 Disable CTS receiver offload on shared QPs
NCCL_ANP_QP_SHARING_GROUPS=2 Number of sharing groups (m in N:m)

Other changes:

  • CTS inline / receiver offload disabled by default (compile-time defines commented out)
  • Device capability caching: maxQpWr and maxCqe stored per device; shared QP/CQ depths clamped to hardware limits
  • UDMA mask steering: shared QPs alternate between high/low UDMA masks based on groupIdx % 2
  • Logging upgrades: several TRACE calls promoted to INFO for QP lifecycle visibility
  • wr_id encoding fix: bitwise OR instead of addition for request index packing (correctness fix for commId-in-upper-bits encoding)

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Introduce a shared QP pool that allows multiple NCCL channels targeting
the same peer to reuse a single IB QP, controlled by NCCL_ANP_QP_SHARING.
Channels are assigned to m groups (NCCL_ANP_QP_SHARING_GROUPS) via a
hash of {device, peer address, direction, group index}. The first
channel in each group creates and owns the QP; subsequent channels
increment refcount and share it. Completion routing uses commId-encoded
wr_id (upper 16 bits) and imm_data to dispatch CQEs to the correct
comm. Teardown is reference-counted with separate QP and CQ lifetimes.

Key architectural components:

1. Shared QP Pool (anpSharedQp / g_sharedQpPool)
   - Global pool of up to 512 shared QP entries, keyed by
     {ibDevN, remoteGid, connectAddr, isSend, groupIdx}
   - FNV-1a hash of the key produces a shareGroupId exchanged during
     connection metadata
   - Reference-counted: first channel becomes owner and creates the QP;
     subsequent channels increment refcount and reuse it
   - Separate refcount (QP lifetime) and cqRefcount (CQ/device-base lifetime)

2. Comm Routing Table (g_commTable)
   - Each comm is assigned a 16-bit commId stored in a global table
   - wr_id encoded as [commId:16 | payload:48] to identify which comm
     owns a given completion
   - On recv side, imm_data carries the sender's remote commId to route
     RDMA_WRITE_WITH_IMM completions to the correct comm

3. Pending Recv Queue
   - Shared QPs can't use per-device event counting for recv completion
     tracking (RECV WQE may be consumed by any sharer)
   - Each comm maintains a FIFO of pending recv requests
     (pendingRecvReqs[]), completions matched via imm_data routing +
     FIFO ordering
   - sharedRecvDone flag on requests replaces event-count-based
     completion detection

4. Connection Setup (Connect/Accept)
   - Sender side: looks up existing shared QP by key; if found, reuses
     it (skips RTR/RTS); if not, creates new one and registers as owner
   - Receiver side: matches on shareGroupId from sender metadata; same
     owner/sharer logic
   - shareGroupId, commId, and sharedGroupIdx exchanged in
     ncclIbConnectionMetadata

5. Teardown (CloseSend/CloseRecv)
   - Shared QP: decrement refcount; destroy QP only when last sharer
     closes; destroy CQ/device-base only when cqRefcount reaches 0
   - Non-shared: unchanged original path

Environment variables (all opt-in, disabled by default):
  NCCL_ANP_QP_SHARING=0          Enable QP sharing
  NCCL_ANP_QP_SHARING_DEPTH=4    WQ depth multiplier for shared QPs
  NCCL_ANP_QP_SHARING_DISABLE_CTS=1  Disable CTS receiver offload on shared QPs
  NCCL_ANP_QP_SHARING_GROUPS=2   Number of sharing groups (m in N:m)

Other changes:
  - CTS inline / receiver offload disabled by default (compile-time
    defines commented out)
  - Device capability caching: maxQpWr and maxCqe stored per device;
    shared QP/CQ depths clamped to hardware limits
  - UDMA mask steering: shared QPs alternate between high/low UDMA masks
    based on groupIdx % 2
  - Logging upgrades: several TRACE calls promoted to INFO for QP
    lifecycle visibility
  - wr_id encoding fix: bitwise OR instead of addition for request index
    packing (correctness fix for commId-in-upper-bits encoding)

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Comment thread src/net_ib.cc
NCCL_PARAM(AnpQpSharing, "ANP_QP_SHARING", 0);
NCCL_PARAM(AnpQpSharingDepth, "ANP_QP_SHARING_DEPTH", 4);
NCCL_PARAM(AnpQpSharingDisableCts, "ANP_QP_SHARING_DISABLE_CTS", 1);
NCCL_PARAM(AnpQpSharingGroups, "ANP_QP_SHARING_GROUPS", 2);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we getting good perf numbers even with 2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with ANP_QP_SHARING_GROUPS to 2, was getting peak busbw of 344 Gbps for AllReduce-16G msg.
Only with ANP_QP_SHARING_GROUPS set to 4, we hit the peak 357 Gbps.
just wanted to set to default even number, so we have distributed across UD.

Comment thread src/net_ib.cc
//#define ANP_DEBUG_TRACE_EN
#define CTS_INLINE_ENABLED
#define CTS_RCVR_OFFLOAD_ENABLED
//#define CTS_INLINE_ENABLED
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the inline is independent of RCVR offload. We should eventually try enabling cts inline.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, the PR needs bit of a cleanup and re-test for CTS RCVR offload. Since originally we started these experiments disabling offload, just sticked to it. anyway if sharing enabled it will disable CTS RCVR offload at runtime because of the env ANP_QP_SHARING_DISABLE_CTS defaulted to 1.

Follow-up question would be, for QP sharing do we need to support CTS RCVR offload enabled ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design we cant support CTS RCVR offload with QP sharing.

Comment thread src/net_ib.cc
// CQ is sized to accommodate the max SQ + RQ WQE completions. If each SQ WQE could be signaled, then,
// for each QP, there can be 2*MAX_REQUESTS completions for SQ and MAX_REQUESTS completions for RQ.
NCCLCHECK(wrap_ibv_create_cq(&base->cq, ibDev->context, 3*MAX_REQUESTS*ncclParamIbQpsPerConn(), cq_context, NULL, 0));
int cqDepth = 3 * MAX_REQUESTS * ncclParamIbQpsPerConn();
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up: do we need to create CQ for non-owner channels.

Comment thread src/net_ib.cc
}
ncclIbDevs[ncclNIbDevs].maxQp = devAttr.max_qp;
ncclIbDevs[ncclNIbDevs].maxQpWr = devAttr.max_qp_wr;
ncclIbDevs[ncclNIbDevs].maxCqe = devAttr.max_cqe;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is max_cqe not being used earlier and is used only with these changes?

Comment thread src/net_ib.cc
// ============================================================

struct anpSharedQpKey {
int ibDevN;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this identify?

Comment thread src/net_ib.cc
int ibDevN;
union ibv_gid remoteGid;
union ncclSocketAddress connectAddr;
bool isSend;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of the key?

Comment thread src/net_ib.cc
// QP Sharing Infrastructure
// ============================================================

struct anpSharedQpKey {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in my understanding the key should identify uniqely a pair of ranks. the connect_addr and remote_gid identifies the remote rank. What identifies the local rank?

Comment thread src/net_ib.cc
struct ibv_cq* ownerCq;
struct ncclIbNetCommDevBase* ownerDevBase;
int ownerIbDevN;
int refcount;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment as to what this ref count is?

… tracing

      Crash fix:
      - Add senderListenId (PID) to ncclIbConnectionMetadata and ncclIbHandle
        to disambiguate peers with identical shareGroupId hashes
      - Replace hash-based anpFindSharedQpByShareGroup() with full-key lookup
        using anpFindSharedQp() on accept side — prevents two different senders
        from colliding into the same shared QP
      - Refactor anpSharedQpKey: remove remoteGid, rename connectAddr→peerAddr,
        add peerListenId — key is now {ibDevN, peerAddr(stripped), peerListenId,
        isSend, groupIdx}
      - Accept-side registration uses the same lookup key used for the find,
        eliminating the incomplete key that was missing peerAddr and peerListenId
      - anpCountPeerTotalRefcount now scopes by peerListenId

      QP sharing decision tracing (5 WARN logs, prefix NET/ANP/QPS:):
      - Connect lookup: full key, REUSE/NEW decision, QPN, refcount
      - Connect REGISTERED: pool entry details for new QP owners
      - Connect SEND meta: metadata sent to receiver for correlation
      - Accept RECV meta: metadata received from sender
      - Accept lookup: accept-side key, REUSE/NEW decision, QPN, refcount

* Support Multiple QP sharing groups extension and distributed owner sharing/UD pinning
sarat-k added a commit that referenced this pull request May 7, 2026
Squashed from PR #15 (2 commits):
    - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage
    - Fix accept-side QP sharing hash collision + add WARN-level decision tracing

    Cleanup and refactor comm grouping (PR#15 follow-up)

    Rename all QP sharing references to comm grouping mental model:
    - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup
    - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup
    - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid
    - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant)

    Consolidate env variables from 4 to 2:
    - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled)
    - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs)
    - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0)
    - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1)
    - Simplify queue depth from ceil(D/m) to direct multiplier

    Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot).
    Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop.
    Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc).


(cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff)

Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
sarat-k added a commit that referenced this pull request May 7, 2026
* Plugin update (#14)

Compatible RCCL:
 https://github.com/ROCm/rccl/tree/develop (commit hash 420b3b840e0324ea897db7f04028471a4ea830d7)

 Pen amd-anp sha: 96ba08c5d900e286a5dc8d50cb84da4438adb662

(cherry picked from commit 1243259)

* Add N:m QP sharing for AMD ANP plugin (#105) (#16)

Squashed from PR #15 (2 commits):
    - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage
    - Fix accept-side QP sharing hash collision + add WARN-level decision tracing

    Cleanup and refactor comm grouping (PR#15 follow-up)

    Rename all QP sharing references to comm grouping mental model:
    - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup
    - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup
    - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid
    - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant)

    Consolidate env variables from 4 to 2:
    - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled)
    - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs)
    - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0)
    - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1)
    - Simplify queue depth from ceil(D/m) to direct multiplier

    Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot).
    Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop.
    Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc).

(cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff)

Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
(cherry picked from commit 5e15f51)

---------

Co-authored-by: Karthikeyan Arumugam <karthik@pensando.io>
Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants