Skip to content

Add N:m QP sharing for AMD ANP plugin (#105)#16

Merged
sarat-k merged 1 commit intoROCm:mainfrom
sarat-k:shared-comm
May 7, 2026
Merged

Add N:m QP sharing for AMD ANP plugin (#105)#16
sarat-k merged 1 commit intoROCm:mainfrom
sarat-k:shared-comm

Conversation

@sarat-k
Copy link
Copy Markdown
Contributor

@sarat-k sarat-k commented May 7, 2026

- Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage
- Fix accept-side QP sharing hash collision + add WARN-level decision tracing

Cleanup and refactor comm grouping (PR#15 follow-up)

Rename all QP sharing references to comm grouping mental model:
- Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup
- Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup
- Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid
- Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant)

Consolidate env variables from 4 to 2:
- Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled)
- Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs)
- Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0)
- Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1)
- Simplify queue depth from ceil(D/m) to direct multiplier

Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot).
Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop.

(cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff)

Squashed from PR ROCm#15 (2 commits):
    - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage
    - Fix accept-side QP sharing hash collision + add WARN-level decision tracing

    Cleanup and refactor comm grouping (PR#15 follow-up)

    Rename all QP sharing references to comm grouping mental model:
    - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup
    - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup
    - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid
    - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant)

    Consolidate env variables from 4 to 2:
    - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled)
    - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs)
    - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0)
    - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1)
    - Simplify queue depth from ceil(D/m) to direct multiplier

    Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot).
    Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop.
    Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc).

Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
(cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff)
@sarat-k sarat-k merged commit 5e15f51 into ROCm:main May 7, 2026
@sarat-k sarat-k deleted the shared-comm branch May 7, 2026 02:15
sarat-k added a commit that referenced this pull request May 7, 2026
* Plugin update (#14)

Compatible RCCL:
 https://github.com/ROCm/rccl/tree/develop (commit hash 420b3b840e0324ea897db7f04028471a4ea830d7)

 Pen amd-anp sha: 96ba08c5d900e286a5dc8d50cb84da4438adb662

(cherry picked from commit 1243259)

* Add N:m QP sharing for AMD ANP plugin (#105) (#16)

Squashed from PR #15 (2 commits):
    - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage
    - Fix accept-side QP sharing hash collision + add WARN-level decision tracing

    Cleanup and refactor comm grouping (PR#15 follow-up)

    Rename all QP sharing references to comm grouping mental model:
    - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup
    - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup
    - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid
    - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant)

    Consolidate env variables from 4 to 2:
    - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled)
    - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs)
    - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0)
    - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1)
    - Simplify queue depth from ceil(D/m) to direct multiplier

    Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot).
    Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop.
    Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc).

(cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff)

Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
(cherry picked from commit 5e15f51)

---------

Co-authored-by: Karthikeyan Arumugam <karthik@pensando.io>
Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant