-
Notifications
You must be signed in to change notification settings - Fork 80
Torch integration #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Torch integration #692
Conversation
include/mscclpp/algorithm.hpp
Outdated
| std::vector<RegisteredMemory> registeredMemories; | ||
| std::vector<MemoryChannel> memoryChannels; | ||
| std::vector<SwitchChannel> switchChannels; | ||
| std::vector<PortChannel> portChannels; | ||
| std::vector<std::shared_ptr<NvlsConnection>> nvlsConnections; | ||
| std::shared_ptr<DeviceHandle<MemoryChannel>> memoryChannelDeviceHandles; | ||
| std::shared_ptr<DeviceHandle<SwitchChannel>> switchChannelDeviceHandles; | ||
| std::shared_ptr<DeviceHandle<PortChannel>> portChannelDeviceHandles; | ||
| std::vector<std::shared_ptr<MemoryDevice2DeviceSemaphore>> memorySemaphores; | ||
| std::vector<std::shared_ptr<Host2DeviceSemaphore>> hostSemaphores; | ||
| std::unordered_map<std::string, std::shared_ptr<void>> extras; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they need to be public? Note this is a user-facing header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users need this to implement customized algorithm, for example: https://github.com/microsoft/mscclpp/blob/105239fc6c8c246e77a6f54504cb2836a67ca234/examples/customized-collective-algorithm/customized_allgather.cu
Also, this structure is not stable, will change to use BaseChannel instead Channel which memory attached in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to the internal header folder
chhwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting all my minor comments below.
examples/torch-integration/customized_allgather.cu: "MIT license" -> "MIT License"
python/csrc/ext/algorithm_collection_builder.cpp: rename the file into python/csrc/ext/algorithm_collection_builder_py.cpp
python/mscclpp/_core/__init__.py: "MIT license" -> "MIT License"
python/mscclpp/_core/buffer.py: "MIT license" -> "MIT License"
include/mscclpp/ext/collectives/algorithm_collection_builder.hpp: missing license
src/core/include/logger.hpp: let case LogSubsys::COUNT: case fallback to default (return "UNKNOWN"), since it's only for counting the number of LogSubsys types and using this directly for logging is not a valid behavior. Let's put a regarding comment here since copilot keeps suggesting this.
src/core/algorithm.cc: "MIT license" -> "MIT License"
src/core/CMakeLists.txt: "MIT license" -> "MIT License"
src/core/gpu_utils.cc: fix the comment "!defined(HIP_PLATFORM_AMD)" -> "!defined(MSCCLPP_USE_ROCM)"
src/ext/collectives/allgather/allgather_fullmesh2.cu: rename the file into src/ext/collectives/allgather/allgather_fullmesh_2.cu (would be better if we give a proper name rather than "_2")
src/ext/collectives/allreduce/allreduce_nvls_with_copy2.cu: rename the file into src/ext/collectives/allreduce/allreduce_nvls_with_copy_2.cu (would be better if we give a proper name rather than "_2")
src/ext/collectives/include/allgather/allgather_fullmesh.hpp: missing #ifdef guard in this header file - use #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP
src/ext/collectives/include/allgather/allgather_fullmesh2.hpp: rename the file into src/ext/collectives/include/allgather/allgather_fullmesh_2.hpp (would be better if we give a proper name rather than "_2"). Also, missing #ifdef guard in this header file - use #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP
src/ext/collectives/include/allreduce/allreduce_allpair_packet.hpp: "MIT license" -> "MIT License"
src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp: "MIT license" -> "MIT License"
src/ext/collectives/include/allreduce/allreduce_nvls_with_copy2.hpp: rename the file into src/ext/collectives/include/allreduce/allreduce_nvls_with_copy_2.hpp (would be better if we give a proper name rather than "_2")
src/ext/collectives/include/collective_utils.hpp: follow #ifdef MSCCLPP_EXT_<FILE_NAME>_HPP format for the header guard
src/ext/collectives/algorithm_collection_builder.cc: missing license
src/ext/collectives/CMakeLists.txt: "MIT license" -> "MIT License"
src/ext/nccl/CMakeLists.txt: "MIT license" -> "MIT License"
src/ext/CMakeLists.txt: missing license
include/mscclpp/algorithm.hpp: enum value names of CommResult (e.g., commSuccess) is inconsistent with other enums. But it would be too long if we use CAPITAL_CASE style, so I'd recommend CamelCase style, including in CommResult and in other enums in this header.
Overall: Review the documentations under docs/ to see if all the texts are still aligned with the changed paths. Especially, we need to update the binary paths (such as mp_unit_tests build path) mentioned in documentations.
@copilot work on this |
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…d style consistency (#725) - [x] Fix license text: "MIT license" → "MIT License" in multiple files - [x] Rename files with "_2" suffix and update references - [x] Add missing license headers - [x] Fix header guards to follow MSCCLPP_EXT_<FILE_NAME>_HPP_ format - [x] Fix enum naming consistency - [x] CommResult enum to CamelCase - [x] CollectiveBufferMode enum to CamelCase (Any, InPlace, OutOfPlace) - [x] AlgorithmType enum to CamelCase (Native, DSL) - [x] Fix comment in src/core/gpu_utils.cc - [x] Fix LogSubsys::COUNT case in src/core/include/logger.hpp - [x] Add explanatory comment - [x] Add [[fallthrough]] attribute - [x] Apply clang-format - [x] Remove _codeql_detected_source_root file and add to .gitignore - [x] Update documentation paths - [x] Fix NCCL library paths: build/apps/nccl/ → build/lib/ - [x] Fix test binary paths: ./test/ → ./bin/ <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Reorganize current native algorithm implementation and DSL algorithm implementation.
Provide unified API for DSL algo and native algo and provide interface to tune the algo
Provide interface for pytorch integration with native API and DSL