Skip to content

feat(megatron): add nccl_comm_warmup to avoid iteration-1 NCCL cudaMalloc OOM (#6387)#9602

Open
yuchenwang3 wants to merge 1 commit into
modelscope:mainfrom
yuchenwang3:feat/nccl-comm-warmup
Open

feat(megatron): add nccl_comm_warmup to avoid iteration-1 NCCL cudaMalloc OOM (#6387)#9602
yuchenwang3 wants to merge 1 commit into
modelscope:mainfrom
yuchenwang3:feat/nccl-comm-warmup

Conversation

@yuchenwang3

Copy link
Copy Markdown
Contributor

What

Adds an opt-in nccl_comm_warmup flag that eagerly creates the NCCL communicators before the training loop, to avoid an iteration-1 OOM.

Why

NCCL communicators are created lazily on first use. For the per-step collectives that ms-swift/Megatron issue inside the training step (e.g. the dp/cp loss all-reduce and grad-sync coalescing), that first use happens at the iteration-1 memory peak. At that point NCCL's internal cudaMalloc for the communicator buffers can fail with:

Failed to CUDA calloc async N bytes

so training dies on the very first step on large models / tight-memory configs. This is the error reported in #6387 (Qwen3-Next-80B-A3B, 8×H20), which was only worked around by changing the parallel layout (--expert_model_parallel_size 8) rather than fixed in code. We hit the same failure on Qwen3.5-35B-A3B.

Change

When --nccl_comm_warmup true, right before on_train_begin (in setup_model_training), fire one numerically-inert 1-element all_reduce on each parallel group (dp, dp+cp, cp, tp, pp, model, embedding, position-embedding). This forces the communicators to be allocated while memory is still free; groups that don't exist for the current parallel config are skipped. Default is False, so behavior is unchanged unless explicitly enabled.

Notes

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new configuration option nccl_comm_warmup to eagerly initialize NCCL communicators before the training loop starts, preventing potential out-of-memory errors during lazy allocation at peak memory usage. The review feedback suggests extracting the list of group getters into a named constant to improve code readability and maintainability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +626 to +635
for getter, kwargs in (
(mpu.get_data_parallel_group, {'with_context_parallel': True}),
(mpu.get_data_parallel_group, {}),
(mpu.get_context_parallel_group, {}),
(mpu.get_tensor_model_parallel_group, {}),
(mpu.get_pipeline_model_parallel_group, {}),
(mpu.get_model_parallel_group, {}),
(mpu.get_embedding_group, {}),
(mpu.get_position_embedding_group, {}),
):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For improved readability and maintainability, consider extracting this list of group getters into a named constant. This makes the purpose of the list clearer and simplifies future modifications. You can then move the constant definition to the module level if you prefer.

            # List of (getter, kwargs) for all communicators to warm up.
            GROUPS_TO_WARM_UP = [
                (mpu.get_data_parallel_group, {'with_context_parallel': True}),
                (mpu.get_data_parallel_group, {}),
                (mpu.get_context_parallel_group, {}),
                (mpu.get_tensor_model_parallel_group, {}),
                (mpu.get_pipeline_model_parallel_group, {}),
                (mpu.get_model_parallel_group, {}),
                (mpu.get_embedding_group, {}),
                (mpu.get_position_embedding_group, {}),
            ]
            for getter, kwargs in GROUPS_TO_WARM_UP:

…lloc OOM

NCCL communicators are created lazily; the per-step collectives' first use
lands at the iteration-1 memory peak, where NCCL's internal cudaMalloc can
fail with 'Failed to CUDA calloc async N bytes' and kill training on step 1
(see modelscope#6387). Add an opt-in nccl_comm_warmup flag that fires an inert
1-element all_reduce on each parallel group before the loop, forcing the
communicators to be created while memory is still free. Default off.

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
@yuchenwang3 yuchenwang3 force-pushed the feat/nccl-comm-warmup branch from 1a0d1ec to 83197c5 Compare June 18, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant