feat(megatron): add nccl_comm_warmup to avoid iteration-1 NCCL cudaMalloc OOM (#6387) by yuchenwang3 · Pull Request #9602 · modelscope/ms-swift

yuchenwang3 · 2026-06-18T08:37:48Z

What

Adds an opt-in nccl_comm_warmup flag that eagerly creates the NCCL communicators before the training loop, to avoid an iteration-1 OOM.

Why

NCCL communicators are created lazily on first use. For the per-step collectives that ms-swift/Megatron issue inside the training step (e.g. the dp/cp loss all-reduce and grad-sync coalescing), that first use happens at the iteration-1 memory peak. At that point NCCL's internal cudaMalloc for the communicator buffers can fail with:

Failed to CUDA calloc async N bytes

so training dies on the very first step on large models / tight-memory configs. This is the error reported in #6387 (Qwen3-Next-80B-A3B, 8×H20), which was only worked around by changing the parallel layout (--expert_model_parallel_size 8) rather than fixed in code. We hit the same failure on Qwen3.5-35B-A3B.

Change

When --nccl_comm_warmup true, right before on_train_begin (in setup_model_training), fire one numerically-inert 1-element all_reduce on each parallel group (dp, dp+cp, cp, tp, pp, model, embedding, position-embedding). This forces the communicators to be allocated while memory is still free; groups that don't exist for the current parallel config are skipped. Default is False, so behavior is unchanged unless explicitly enabled.

Notes

Opt-in and default-off; zero effect on existing runs.
The dummy all-reduce is a no-op numerically (a 1-element zero tensor).
Verified the two files compile; could not run a multi-GPU job locally — relying on CI / users who hit Qwen3-Next-80B-A3B-Instruct 8卡H20，lora微调报错OOM Failed to CUDA calloc async 136 bytes #6387.

gemini-code-assist

Code Review

This pull request introduces a new configuration option nccl_comm_warmup to eagerly initialize NCCL communicators before the training loop starts, preventing potential out-of-memory errors during lazy allocation at peak memory usage. The review feedback suggests extracting the list of group getters into a named constant to improve code readability and maintainability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T08:40:30Z

+            for getter, kwargs in (
+                (mpu.get_data_parallel_group, {'with_context_parallel': True}),
+                (mpu.get_data_parallel_group, {}),
+                (mpu.get_context_parallel_group, {}),
+                (mpu.get_tensor_model_parallel_group, {}),
+                (mpu.get_pipeline_model_parallel_group, {}),
+                (mpu.get_model_parallel_group, {}),
+                (mpu.get_embedding_group, {}),
+                (mpu.get_position_embedding_group, {}),
+            ):


For improved readability and maintainability, consider extracting this list of group getters into a named constant. This makes the purpose of the list clearer and simplifies future modifications. You can then move the constant definition to the module level if you prefer.

# List of (getter, kwargs) for all communicators to warm up. GROUPS_TO_WARM_UP = [ (mpu.get_data_parallel_group, {'with_context_parallel': True}), (mpu.get_data_parallel_group, {}), (mpu.get_context_parallel_group, {}), (mpu.get_tensor_model_parallel_group, {}), (mpu.get_pipeline_model_parallel_group, {}), (mpu.get_model_parallel_group, {}), (mpu.get_embedding_group, {}), (mpu.get_position_embedding_group, {}), ] for getter, kwargs in GROUPS_TO_WARM_UP:

…lloc OOM NCCL communicators are created lazily; the per-step collectives' first use lands at the iteration-1 memory peak, where NCCL's internal cudaMalloc can fail with 'Failed to CUDA calloc async N bytes' and kill training on step 1 (see modelscope#6387). Add an opt-in nccl_comm_warmup flag that fires an inert 1-element all_reduce on each parallel group before the loop, forcing the communicators to be created while memory is still free. Default off. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

yuchenwang3 mentioned this pull request Jun 18, 2026

Qwen3-Next-80B-A3B-Instruct 8卡H20，lora微调报错OOM Failed to CUDA calloc async 136 bytes #6387

Closed

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

yuchenwang3 force-pushed the feat/nccl-comm-warmup branch from 1a0d1ec to 83197c5 Compare June 18, 2026 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(megatron): add nccl_comm_warmup to avoid iteration-1 NCCL cudaMalloc OOM (#6387)#9602

feat(megatron): add nccl_comm_warmup to avoid iteration-1 NCCL cudaMalloc OOM (#6387)#9602
yuchenwang3 wants to merge 1 commit into
modelscope:mainfrom
yuchenwang3:feat/nccl-comm-warmup

yuchenwang3 commented Jun 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuchenwang3 commented Jun 18, 2026

What

Why

Change

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant