Add N:m QP sharing for AMD ANP plugin (#105)#16
Merged
Conversation
Squashed from PR ROCm#15 (2 commits): - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage - Fix accept-side QP sharing hash collision + add WARN-level decision tracing Cleanup and refactor comm grouping (PR#15 follow-up) Rename all QP sharing references to comm grouping mental model: - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant) Consolidate env variables from 4 to 2: - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled) - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs) - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0) - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1) - Simplify queue depth from ceil(D/m) to direct multiplier Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot). Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop. Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc). Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com> (cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff)
sarat-k
added a commit
that referenced
this pull request
May 7, 2026
* Plugin update (#14) Compatible RCCL: https://github.com/ROCm/rccl/tree/develop (commit hash 420b3b840e0324ea897db7f04028471a4ea830d7) Pen amd-anp sha: 96ba08c5d900e286a5dc8d50cb84da4438adb662 (cherry picked from commit 1243259) * Add N:m QP sharing for AMD ANP plugin (#105) (#16) Squashed from PR #15 (2 commits): - Add N:m QP sharing for AMD ANP plugin to reduce HCA QP resource usage - Fix accept-side QP sharing hash collision + add WARN-level decision tracing Cleanup and refactor comm grouping (PR#15 follow-up) Rename all QP sharing references to comm grouping mental model: - Structs: anpSharedQpKey -> anpCommGroupKey, anpSharedQpEntry -> anpCommGroup - Functions: anpRegisterSharedQp -> anpAddCommGroup, anpFindSharedQp -> anpFindCommGroup - Fields: sharedGroupIdx -> commGroupIdx, shareGroupId -> senderPid, peerListenId -> peerPid - Wire protocol: remove groupHash from ncclIbConnectionMetadata (redundant) Consolidate env variables from 4 to 2: - Remove NCCL_ANP_COMM_GROUPING (redundant, NGROUPS>0 implies enabled) - Remove NCCL_ANP_COMM_GROUPING_DISABLE_CTS (hardcode for grouped QPs) - Rename NCCL_ANP_COMM_GROUPING_NGROUPS -> NCCL_ANP_COMM_NGROUPS (default 0) - Rename NCCL_ANP_COMM_GROUPING_DEPTH -> NCCL_ANP_QP_DEPTH_MULTIPLIER (default 1) - Simplify queue depth from ceil(D/m) to direct multiplier Fix commId table collision on wrap (anpCommDbEntryAdd scans for free slot). Refactor CloseSend/CloseRecv to deduplicate per-device cleanup loop. Add TODO.md documenting open issues (use-after-free, wasteful CQ, etc). (cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff) Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com> (cherry picked from commit 5e15f51) --------- Co-authored-by: Karthikeyan Arumugam <karthik@pensando.io> Co-authored-by: Sarat Kamisetty <sakamiset@ainic16-headnode.prov.aus.ccs.cpe.ice.amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(cherry picked from commit 6cb64982e80e955c8d23daa457b6c8e2e60aa3ff)