feat(megatron): add nccl_comm_warmup to avoid iteration-1 NCCL cudaMalloc OOM (#6387)#9602
feat(megatron): add nccl_comm_warmup to avoid iteration-1 NCCL cudaMalloc OOM (#6387)#9602yuchenwang3 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new configuration option nccl_comm_warmup to eagerly initialize NCCL communicators before the training loop starts, preventing potential out-of-memory errors during lazy allocation at peak memory usage. The review feedback suggests extracting the list of group getters into a named constant to improve code readability and maintainability.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for getter, kwargs in ( | ||
| (mpu.get_data_parallel_group, {'with_context_parallel': True}), | ||
| (mpu.get_data_parallel_group, {}), | ||
| (mpu.get_context_parallel_group, {}), | ||
| (mpu.get_tensor_model_parallel_group, {}), | ||
| (mpu.get_pipeline_model_parallel_group, {}), | ||
| (mpu.get_model_parallel_group, {}), | ||
| (mpu.get_embedding_group, {}), | ||
| (mpu.get_position_embedding_group, {}), | ||
| ): |
There was a problem hiding this comment.
For improved readability and maintainability, consider extracting this list of group getters into a named constant. This makes the purpose of the list clearer and simplifies future modifications. You can then move the constant definition to the module level if you prefer.
# List of (getter, kwargs) for all communicators to warm up.
GROUPS_TO_WARM_UP = [
(mpu.get_data_parallel_group, {'with_context_parallel': True}),
(mpu.get_data_parallel_group, {}),
(mpu.get_context_parallel_group, {}),
(mpu.get_tensor_model_parallel_group, {}),
(mpu.get_pipeline_model_parallel_group, {}),
(mpu.get_model_parallel_group, {}),
(mpu.get_embedding_group, {}),
(mpu.get_position_embedding_group, {}),
]
for getter, kwargs in GROUPS_TO_WARM_UP:…lloc OOM NCCL communicators are created lazily; the per-step collectives' first use lands at the iteration-1 memory peak, where NCCL's internal cudaMalloc can fail with 'Failed to CUDA calloc async N bytes' and kill training on step 1 (see modelscope#6387). Add an opt-in nccl_comm_warmup flag that fires an inert 1-element all_reduce on each parallel group before the loop, forcing the communicators to be created while memory is still free. Default off. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
1a0d1ec to
83197c5
Compare
What
Adds an opt-in
nccl_comm_warmupflag that eagerly creates the NCCL communicators before the training loop, to avoid an iteration-1 OOM.Why
NCCL communicators are created lazily on first use. For the per-step collectives that ms-swift/Megatron issue inside the training step (e.g. the dp/cp loss all-reduce and grad-sync coalescing), that first use happens at the iteration-1 memory peak. At that point NCCL's internal
cudaMallocfor the communicator buffers can fail with:so training dies on the very first step on large models / tight-memory configs. This is the error reported in #6387 (Qwen3-Next-80B-A3B, 8×H20), which was only worked around by changing the parallel layout (
--expert_model_parallel_size 8) rather than fixed in code. We hit the same failure on Qwen3.5-35B-A3B.Change
When
--nccl_comm_warmup true, right beforeon_train_begin(insetup_model_training), fire one numerically-inert 1-elementall_reduceon each parallel group (dp, dp+cp, cp, tp, pp, model, embedding, position-embedding). This forces the communicators to be allocated while memory is still free; groups that don't exist for the current parallel config are skipped. Default isFalse, so behavior is unchanged unless explicitly enabled.Notes