Mahdieh/gb200 nvloptimized #708

mahdiehghazim · 2025-12-17T19:27:21Z

This PR improves the performance of msccl++ on GB200. We need to update the quick start guide also adding that this option needs to be added to cmake command for compilation on GB200:

-DMSCCLPP_GPU_ARCHS=100

Copilot

Pull request overview

This PR optimizes MSCCL++ allreduce operations for GB200 (NVIDIA's next-generation GPU with compute capability 10.0) by adjusting block counts, switch channels, and memory alignment parameters specifically for this architecture. The changes enable better performance on GB200 systems by leveraging architecture-specific optimizations in NVLS (NVLink Switch) operations.

Key changes include:

Increased block count from 8 to 24 for GB200 in NVLS allreduce operations
Increased switch channel count from 8 to 24 for GB200 during initialization
Added 16-byte alignment for block size calculations in device code for GB200

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
apps/nccl/src/allreduce.hpp	Adds GB200-specific device-side optimizations with 16-byte block size alignment and adjusted last-block size handling
apps/nccl/src/allreduce.cu	Implements runtime detection of GB200 to configure optimal block counts (24) and switch channels (24) versus default values (8)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

apps/nccl/src/allreduce.cu

apps/nccl/src/allreduce.hpp

Copilot · 2025-12-17T19:32:55Z

apps/nccl/src/allreduce.hpp

  int bid = blockIdx.x;
  size_t sizePerRank = size / nRanksPerNode;
+#if __CUDA_ARCH__ >=1000
+  size_t sizePerBlock = (sizePerRank / nBlocks) / 16 * 16;


The magic number 16 is used for alignment without explanation. Consider adding a comment explaining that this is for 16-byte alignment optimization on GB200, or defining it as a named constant to improve code readability and maintainability.

apps/nccl/src/allreduce.cu

Copilot · 2025-12-17T19:32:55Z

apps/nccl/src/allreduce.cu

-      int nBlocks = nRanksPerNode;
+      cudaDeviceProp prop;
+      cudaGetDeviceProperties(&prop, 0);
+      int nBlocks = (prop.major == 10) ? 24 : nRanksPerNode;


The magic number 24 for nBlocks on GB200 lacks documentation. Consider adding a comment explaining why 24 blocks are optimal for compute capability 10.0, or referencing relevant documentation or performance tuning results.

apps/nccl/src/allreduce.cu

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

apps/nccl/src/allreduce.cu

Binyang2014

Let's merge this PR after #692. Which make it easy to select and extend algo

Binyang2014

Also I think we can get better performance if we introduce loop unroll, pls refer to code:

mscclpp/src/algorithms/allreduce/allreduce_rsag_pipeline.cu

Lines 93 to 97 in ef6bb8a

    
           #pragma unroll 
        
                   for (int step = 0; step < nStepsPerIter * REDUCE_COPY_RATIO; step++) { 
        
                     uint32_t offset = srcOffset + threadIdInPut + step * blockDim.x * nblocksForPut; 
        
                     tmp[step] = loadPacket(buff, offset, nelems); 
        
                   }

, issue multiple instructions will improve the perf. Not sure if it is helpful for nvls

mahdiehghazim added 2 commits December 9, 2025 19:32

improved zero-copy allreduce with nvls

3b0d5de

clean up the code

72d8f60

mahdiehghazim requested review from Binyang2014, chhwang and Copilot December 17, 2025 19:27

Copilot started reviewing on behalf of mahdiehghazim December 17, 2025 19:28 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Ubuntu and others added 5 commits December 17, 2025 20:41

add some checkings, fix indentation, and add some explanations

ec622a1

comply with clang format

95fac3d

check cuda api ret value with MSCCLPP_CUDATHROW

3592800

Add NCCL directory to CodeQL analysis paths

1128685

Merge branch 'main' into mahdieh/GB200_nvloptimized

bc51d16

mahdiehghazim requested a review from Copilot January 5, 2026 23:15

Copilot started reviewing on behalf of mahdiehghazim January 5, 2026 23:15 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

apps/nccl/src/allreduce.cu Show resolved Hide resolved

apps/nccl/src/allreduce.cu Show resolved Hide resolved

Binyang2014 reviewed Jan 9, 2026

View reviewed changes

Binyang2014 reviewed Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mahdieh/gb200 nvloptimized #708

Mahdieh/gb200 nvloptimized #708

Uh oh!

mahdiehghazim commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Binyang2014 left a comment

Uh oh!

Binyang2014 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	#pragma unroll
	for (int step = 0; step < nStepsPerIter * REDUCE_COPY_RATIO; step++) {
	uint32_t offset = srcOffset + threadIdInPut + step * blockDim.x * nblocksForPut;
	tmp[step] = loadPacket(buff, offset, nelems);
	}

Mahdieh/gb200 nvloptimized #708

Are you sure you want to change the base?

Mahdieh/gb200 nvloptimized #708

Uh oh!

Conversation

mahdiehghazim commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Binyang2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Binyang2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants