Skip to content

Conversation

@mahdiehghazim
Copy link

This PR improves the performance of msccl++ on GB200. We need to update the quick start guide also adding that this option needs to be added to cmake command for compilation on GB200:

-DMSCCLPP_GPU_ARCHS=100

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes MSCCL++ allreduce operations for GB200 (NVIDIA's next-generation GPU with compute capability 10.0) by adjusting block counts, switch channels, and memory alignment parameters specifically for this architecture. The changes enable better performance on GB200 systems by leveraging architecture-specific optimizations in NVLS (NVLink Switch) operations.

Key changes include:

  • Increased block count from 8 to 24 for GB200 in NVLS allreduce operations
  • Increased switch channel count from 8 to 24 for GB200 during initialization
  • Added 16-byte alignment for block size calculations in device code for GB200

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
apps/nccl/src/allreduce.hpp Adds GB200-specific device-side optimizations with 16-byte block size alignment and adjusted last-block size handling
apps/nccl/src/allreduce.cu Implements runtime detection of GB200 to configure optimal block counts (24) and switch channels (24) versus default values (8)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

int bid = blockIdx.x;
size_t sizePerRank = size / nRanksPerNode;
#if __CUDA_ARCH__ >=1000
size_t sizePerBlock = (sizePerRank / nBlocks) / 16 * 16;
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 16 is used for alignment without explanation. Consider adding a comment explaining that this is for 16-byte alignment optimization on GB200, or defining it as a named constant to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.
int nBlocks = nRanksPerNode;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int nBlocks = (prop.major == 10) ? 24 : nRanksPerNode;
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 24 for nBlocks on GB200 lacks documentation. Consider adding a comment explaining why 24 blocks are optimal for compute capability 10.0, or referencing relevant documentation or performance tuning results.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this PR after #692. Which make it easy to select and extend algo

Copy link
Contributor

@Binyang2014 Binyang2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we can get better performance if we introduce loop unroll, pls refer to code:

#pragma unroll
for (int step = 0; step < nStepsPerIter * REDUCE_COPY_RATIO; step++) {
uint32_t offset = srcOffset + threadIdInPut + step * blockDim.x * nblocksForPut;
tmp[step] = loadPacket(buff, offset, nelems);
}
, issue multiple instructions will improve the perf. Not sure if it is helpful for nvls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants