Skip to content

GLM5.1 MoE + PP 训练卡在 Train 0/100:batch_p2p_comm=True 但实际触发 unbatched P2P send/recv lazy NCCL communicator init #9451

Description

@chenhuigou

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

128 卡lora training

Image

GLM 5.1 Megatron LoRA SFT hangs at first training step with Pipeline Parallel on 128 GPUs

Problem

Training GLM 5.1 (MoE, 78 layers, 256 experts) with Megatron LoRA SFT hangs at the first training step and never produces any loss output. The progress bar stays at Train: 0%| | 0/100. No NCCL timeout, no OOM, no Python traceback — just a silent hang.

This has been reproduced consistently across multiple configurations over 20+ attempts.

Environment

  • ms-swift: 4.3.0.dev0
  • megatron-core: 0.17.0
  • mcore-bridge: 1.4.1 (+ GitHub main for rotary_interleaved fix)
  • transformer-engine: 2.15.0
  • PyTorch: 2.10.0, CUDA 13.1
  • Hardware: 16 nodes × 8 H800 GPUs = 128 GPUs
  • Model: ZhipuAI/GLM-5.1 (glm_moe_dsa, 78 layers, 256 experts, DSA attention)

Configurations Tried

PP TP EP DP global_batch_size Result
1 4 8 32 32 OOM during LoRA adapter injection (single card ~95GB full)
2 4 8 8 32 OOM during LoRA adapter injection
4 4 8 8 32 Hang at first step (9+ hours, no timeout)
8 4 8 4 32 Hang at first step (testing)

Hang Symptoms (PP=4 case, observed for 9+ hours)

  1. Model loads successfully (78 layers, 46.9B params, 192M trainable LoRA)
  2. Dataset processes successfully (52k samples, packing to 8192, size=1023)
  3. Progress bar appears: Train: 0%| | 0/100
  4. The following warning appears, then no further progress:
[rankN]:[W ProcessGroupNCCL.cpp:4071] Warning: An unbatched P2P op (send/recv) 
was called on this ProcessGroup with size 4. In lazy initialization mode, this 
will result in a new 2-rank NCCL communicator to be created.
  1. GPU metrics: Only 3/16 executors show GPU utilization (~50-60%), the other 13 are idle. All executors have memory allocated (45-65 GiB), meaning processes are alive but stuck.

  2. No NCCL timeout even with NCCL_TIMEOUT=1800 + NCCL_ASYNC_ERROR_HANDLING=1 set — after 9+ hours, no timeout error was triggered.

Key Observation

The model config shows batch_p2p_comm=True, batch_p2p_sync=True, but the warning says "An unbatched P2P op was called". This mismatch suggests some pipeline P2P operations bypass the batched path and trigger lazy NCCL communicator initialization, which appears to deadlock.

Relevant Config Snippet

megatron sft \
    --model GLM-5.1 \
    --tuner_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --tensor_model_parallel_size 4 \
    --expert_model_parallel_size 8 \
    --pipeline_model_parallel_size 4 \
    --decoder_last_pipeline_num_layers 18 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --sequence_parallel true \
    --micro_batch_size 1 \
    --global_batch_size 32 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --optimizer_cpu_offload false \
    --attention_backend flash

Debug Environment Variables

NCCL_DEBUG=WARN
TORCH_DISTRIBUTED_DEBUG=DETAIL
NCCL_TIMEOUT=1800
NCCL_ASYNC_ERROR_HANDLING=1

Questions

  1. Is GLM 5.1 (MoE + DSA) with pipeline_model_parallel_size > 1 supported? Are there known issues with pipeline parallel for MoE models?
  2. The "unbatched P2P op" warning appears despite batch_p2p_comm=True in the model config — is this expected? Could this cause a deadlock during lazy NCCL communicator initialization?
  3. Is there a recommended parallelism configuration for GLM 5.1 LoRA training on 128 GPUs?

Related Issues

How to Reproduce / 如何复现

NPROC_PER_NODE=${n_gpus_per_node}
NNODES=${nnodes}
NODE_RANK=${ARNOLD_ID}
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--model ${MODEL_PATH}
--dataset ${DATASET_PATH}
--tuner_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--merge_lora false
--tensor_model_parallel_size 4
--expert_model_parallel_size 8
--pipeline_model_parallel_size 4
--decoder_last_pipeline_num_layers 18
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 0.01
--dsa_indexer_loss_coeff 0.01
--sequence_parallel true
--micro_batch_size 1
--global_batch_size 32
--packing true
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--cross_entropy_loss_fusion true
--optimizer_cpu_offload false
--optimizer_offload_fraction 1
--lr 1e-4
--lr_warmup_fraction 0.05
--min_lr 1e-5
--train_iters 100
--max_length 8192
--save_steps 100
--logging_steps 1
--output_dir ${OUTPUT_DIR}
--dataloader_num_workers 4
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--attention_backend flash
--agent_template glm5_1

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions