Torch integration #692

Binyang2014 · 2025-11-19T17:11:47Z

Reorganize current native algorithm implementation and DSL algorithm implementation.
Provide unified API for DSL algo and native algo and provide interface to tune the algo
Provide interface for pytorch integration with native API and DSL

chhwang · 2026-01-15T08:59:20Z

include/mscclpp/algorithm.hpp

+  std::vector<RegisteredMemory> registeredMemories;
+  std::vector<MemoryChannel> memoryChannels;
+  std::vector<SwitchChannel> switchChannels;
+  std::vector<PortChannel> portChannels;
+  std::vector<std::shared_ptr<NvlsConnection>> nvlsConnections;
+  std::shared_ptr<DeviceHandle<MemoryChannel>> memoryChannelDeviceHandles;
+  std::shared_ptr<DeviceHandle<SwitchChannel>> switchChannelDeviceHandles;
+  std::shared_ptr<DeviceHandle<PortChannel>> portChannelDeviceHandles;
+  std::vector<std::shared_ptr<MemoryDevice2DeviceSemaphore>> memorySemaphores;
+  std::vector<std::shared_ptr<Host2DeviceSemaphore>> hostSemaphores;
  std::unordered_map<std::string, std::shared_ptr<void>> extras;


Are they need to be public? Note this is a user-facing header.

Users need this to implement customized algorithm, for example: https://github.com/microsoft/mscclpp/blob/105239fc6c8c246e77a6f54504cb2836a67ca234/examples/customized-collective-algorithm/customized_allgather.cu

Also, this structure is not stable, will change to use BaseChannel instead Channel which memory attached in future.

Move to the internal header folder

chhwang

Putting all my minor comments below.

examples/torch-integration/customized_allgather.cu: "MIT license" -> "MIT License"
python/csrc/ext/algorithm_collection_builder.cpp: rename the file into python/csrc/ext/algorithm_collection_builder_py.cpp
python/mscclpp/_core/__init__.py: "MIT license" -> "MIT License"
python/mscclpp/_core/buffer.py: "MIT license" -> "MIT License"
include/mscclpp/ext/collectives/algorithm_collection_builder.hpp: missing license
src/core/include/logger.hpp: let case LogSubsys::COUNT: case fallback to default (return "UNKNOWN"), since it's only for counting the number of LogSubsys types and using this directly for logging is not a valid behavior. Let's put a regarding comment here since copilot keeps suggesting this.
src/core/algorithm.cc: "MIT license" -> "MIT License"
src/core/CMakeLists.txt: "MIT license" -> "MIT License"
src/core/gpu_utils.cc: fix the comment "!defined(HIP_PLATFORM_AMD)" -> "!defined(MSCCLPP_USE_ROCM)"
src/ext/collectives/allgather/allgather_fullmesh2.cu: rename the file into src/ext/collectives/allgather/allgather_fullmesh_2.cu (would be better if we give a proper name rather than "_2")
src/ext/collectives/allreduce/allreduce_nvls_with_copy2.cu: rename the file into src/ext/collectives/allreduce/allreduce_nvls_with_copy_2.cu (would be better if we give a proper name rather than "_2")
src/ext/collectives/include/allgather/allgather_fullmesh.hpp: missing #ifdef guard in this header file - use #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP
src/ext/collectives/include/allgather/allgather_fullmesh2.hpp: rename the file into src/ext/collectives/include/allgather/allgather_fullmesh_2.hpp (would be better if we give a proper name rather than "_2"). Also, missing #ifdef guard in this header file - use #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP
src/ext/collectives/include/allreduce/allreduce_allpair_packet.hpp: "MIT license" -> "MIT License"
src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp: "MIT license" -> "MIT License"
src/ext/collectives/include/allreduce/allreduce_nvls_with_copy2.hpp: rename the file into src/ext/collectives/include/allreduce/allreduce_nvls_with_copy_2.hpp (would be better if we give a proper name rather than "_2")
src/ext/collectives/include/collective_utils.hpp: follow #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP format for the header guard
src/ext/collectives/algorithm_collection_builder.cc: missing license
src/ext/collectives/CMakeLists.txt: "MIT license" -> "MIT License"
src/ext/nccl/CMakeLists.txt: "MIT license" -> "MIT License"
src/ext/CMakeLists.txt: missing license
include/mscclpp/algorithm.hpp: enum value names of CommResult (e.g., commSuccess) is inconsistent with other enums. But it would be too long if we use CAPITAL_CASE style, so I'd recommend CamelCase style, including in CommResult and in other enums in this header.

Overall: Review the documentations under docs/ to see if all the texts are still aligned with the changed paths. Especially, we need to update the binary paths (such as mp_unit_tests build path) mentioned in documentations.

chhwang · 2026-01-15T09:06:56Z

Putting all my minor comments below.

examples/torch-integration/customized_allgather.cu: "MIT license" -> "MIT License" python/csrc/ext/algorithm_collection_builder.cpp: rename the file into python/csrc/ext/algorithm_collection_builder_py.cpp python/mscclpp/_core/__init__.py: "MIT license" -> "MIT License" python/mscclpp/_core/buffer.py: "MIT license" -> "MIT License" include/mscclpp/ext/collectives/algorithm_collection_builder.hpp: missing license src/core/include/logger.hpp: let case LogSubsys::COUNT: case fallback to default (return "UNKNOWN"), since it's only for counting the number of LogSubsys types and using this directly for logging is not a valid behavior. Let's put a regarding comment here since copilot keeps suggesting this. src/core/algorithm.cc: "MIT license" -> "MIT License" src/core/CMakeLists.txt: "MIT license" -> "MIT License" src/core/gpu_utils.cc: fix the comment "!defined(HIP_PLATFORM_AMD)" -> "!defined(MSCCLPP_USE_ROCM)" src/ext/collectives/allgather/allgather_fullmesh2.cu: rename the file into src/ext/collectives/allgather/allgather_fullmesh_2.cu (would be better if we give a proper name rather than "_2") src/ext/collectives/allreduce/allreduce_nvls_with_copy2.cu: rename the file into src/ext/collectives/allreduce/allreduce_nvls_with_copy_2.cu (would be better if we give a proper name rather than "_2") src/ext/collectives/include/allgather/allgather_fullmesh.hpp: missing #ifdef guard in this header file - use #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP src/ext/collectives/include/allgather/allgather_fullmesh2.hpp: rename the file into src/ext/collectives/include/allgather/allgather_fullmesh_2.hpp (would be better if we give a proper name rather than "_2"). Also, missing #ifdef guard in this header file - use #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP src/ext/collectives/include/allreduce/allreduce_allpair_packet.hpp: "MIT license" -> "MIT License" src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp: "MIT license" -> "MIT License" src/ext/collectives/include/allreduce/allreduce_nvls_with_copy2.hpp: rename the file into src/ext/collectives/include/allreduce/allreduce_nvls_with_copy_2.hpp (would be better if we give a proper name rather than "_2") src/ext/collectives/include/collective_utils.hpp: follow #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP format for the header guard src/ext/collectives/algorithm_collection_builder.cc: missing license src/ext/collectives/CMakeLists.txt: "MIT license" -> "MIT License" src/ext/nccl/CMakeLists.txt: "MIT license" -> "MIT License" src/ext/CMakeLists.txt: missing license include/mscclpp/algorithm.hpp: enum value names of CommResult (e.g., commSuccess) is inconsistent with other enums. But it would be too long if we use CAPITAL_CASE style, so I'd recommend CamelCase style, including in CommResult and in other enums in this header.

Overall: Review the documentations under docs/ to see if all the texts are still aligned with the changed paths. Especially, we need to update the binary paths (such as mp_unit_tests build path) mentioned in documentations.

@copilot work on this

Copilot · 2026-01-15T09:07:07Z

@chhwang I've opened a new pull request, #725, to work on those changes. Once the pull request is ready, I'll request review from you.

chhwang · 2026-01-15T14:47:51Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-15T14:48:07Z

Azure Pipelines successfully started running 1 pipeline(s).

…d style consistency (#725) - [x] Fix license text: "MIT license" → "MIT License" in multiple files - [x] Rename files with "_2" suffix and update references - [x] Add missing license headers - [x] Fix header guards to follow MSCCLPP_EXT_<FILE_NAME>_HPP_ format - [x] Fix enum naming consistency - [x] CommResult enum to CamelCase - [x] CollectiveBufferMode enum to CamelCase (Any, InPlace, OutOfPlace) - [x] AlgorithmType enum to CamelCase (Native, DSL) - [x] Fix comment in src/core/gpu_utils.cc - [x] Fix LogSubsys::COUNT case in src/core/include/logger.hpp - [x] Add explanatory comment - [x] Add [[fallthrough]] attribute - [x] Apply clang-format - [x] Remove _codeql_detected_source_root file and add to .gitignore - [x] Update documentation paths - [x] Fix NCCL library paths: build/apps/nccl/ → build/lib/ - [x] Fix test binary paths: ./test/ → ./bin/  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>

Binyang2014 · 2026-01-15T18:43:40Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-15T18:43:54Z

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 · 2026-01-15T18:51:29Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-15T18:51:43Z

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 · 2026-01-15T22:57:39Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-15T22:57:52Z

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 · 2026-01-15T23:49:55Z

/azp run

azure-pipelines · 2026-01-15T23:50:17Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 · 2026-01-16T00:42:12Z

/azp run

azure-pipelines · 2026-01-16T00:42:35Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 · 2026-01-16T05:08:39Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-16T05:08:53Z

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 added 30 commits October 29, 2025 00:14

doc

9fa4846

doc

61ab551

revise

e5f7a2b

WIP

67e6bcf

WIP

7dca157

WIP

62db986

Merge branch 'main' into binyli/torch-integration

f24b8a6

WIP

1ba1172

WIP

0a5653b

WIP

e77635f

WIP

262485f

update

27bfd7a

WIP

15d2a14

Merge branch 'main' into binyli/torch-integration

f254834

refactor

21903e4

compile pass

1fbec20

update

0ebf12f

WIP

8ba0730

WIP

346cdbe

Refactor

2494ce6

fix

fd6b5e9

WIP

883f9ef

WIP

968f0a9

WIP

b0dcfeb

WIP

270889d

WIP

8d2eaeb

WIP

54ac481

fix

c9e8d17

fix perf

033d862

update python binding

b3c2935

chhwang reviewed Jan 15, 2026

View reviewed changes

Copilot AI mentioned this pull request Jan 15, 2026

Address code review feedback: license headers, naming conventions, and style consistency #725

Merged

17 tasks

Copilot AI and others added 2 commits January 15, 2026 09:46

fix ci issue

c1db742

lint fix

298e3b0

Binyang2014 added 2 commits January 15, 2026 21:58

fix build for fp8

a0fe68e

fix

d21decd

Binyang2014 and others added 2 commits January 15, 2026 23:27

fix for tests

5d11bc8

fix

0dcdc04

fix ut

85299c7

fix for nccl-test

bdabb12

Binyang2014 added 2 commits January 15, 2026 23:32

Merge branch 'main' into binyli/torch-integration

0e03bcc

move context structure to internal header

5d66c38

Torch integration #692

Are you sure you want to change the base?

Torch integration #692

Conversation

Binyang2014 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chhwang Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

chhwang left a comment

Choose a reason for hiding this comment

Uh oh!

chhwang commented Jan 15, 2026

Uh oh!

Copilot AI commented Jan 15, 2026

Uh oh!

chhwang commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

Binyang2014 commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

Binyang2014 commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

Binyang2014 commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

Binyang2014 commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

Binyang2014 commented Jan 16, 2026

Uh oh!

azure-pipelines bot commented Jan 16, 2026

Uh oh!

Binyang2014 commented Jan 16, 2026

Uh oh!

azure-pipelines bot commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Binyang2014 commented Nov 19, 2025 •

edited

Loading