Checklist / 检查清单
Bug Description / Bug 描述
128 卡lora training
GLM 5.1 Megatron LoRA SFT hangs at first training step with Pipeline Parallel on 128 GPUs
Problem
Training GLM 5.1 (MoE, 78 layers, 256 experts) with Megatron LoRA SFT hangs at the first training step and never produces any loss output. The progress bar stays at Train: 0%| | 0/100. No NCCL timeout, no OOM, no Python traceback — just a silent hang.
This has been reproduced consistently across multiple configurations over 20+ attempts.
Environment
- ms-swift: 4.3.0.dev0
- megatron-core: 0.17.0
- mcore-bridge: 1.4.1 (+ GitHub main for rotary_interleaved fix)
- transformer-engine: 2.15.0
- PyTorch: 2.10.0, CUDA 13.1
- Hardware: 16 nodes × 8 H800 GPUs = 128 GPUs
- Model: ZhipuAI/GLM-5.1 (
glm_moe_dsa, 78 layers, 256 experts, DSA attention)
Configurations Tried
| PP |
TP |
EP |
DP |
global_batch_size |
Result |
| 1 |
4 |
8 |
32 |
32 |
OOM during LoRA adapter injection (single card ~95GB full) |
| 2 |
4 |
8 |
8 |
32 |
OOM during LoRA adapter injection |
| 4 |
4 |
8 |
8 |
32 |
Hang at first step (9+ hours, no timeout) |
| 8 |
4 |
8 |
4 |
32 |
Hang at first step (testing) |
Hang Symptoms (PP=4 case, observed for 9+ hours)
- Model loads successfully (78 layers, 46.9B params, 192M trainable LoRA)
- Dataset processes successfully (52k samples, packing to 8192, size=1023)
- Progress bar appears:
Train: 0%| | 0/100
- The following warning appears, then no further progress:
[rankN]:[W ProcessGroupNCCL.cpp:4071] Warning: An unbatched P2P op (send/recv)
was called on this ProcessGroup with size 4. In lazy initialization mode, this
will result in a new 2-rank NCCL communicator to be created.
-
GPU metrics: Only 3/16 executors show GPU utilization (~50-60%), the other 13 are idle. All executors have memory allocated (45-65 GiB), meaning processes are alive but stuck.
-
No NCCL timeout even with NCCL_TIMEOUT=1800 + NCCL_ASYNC_ERROR_HANDLING=1 set — after 9+ hours, no timeout error was triggered.
Key Observation
The model config shows batch_p2p_comm=True, batch_p2p_sync=True, but the warning says "An unbatched P2P op was called". This mismatch suggests some pipeline P2P operations bypass the batched path and trigger lazy NCCL communicator initialization, which appears to deadlock.
Relevant Config Snippet
megatron sft \
--model GLM-5.1 \
--tuner_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--tensor_model_parallel_size 4 \
--expert_model_parallel_size 8 \
--pipeline_model_parallel_size 4 \
--decoder_last_pipeline_num_layers 18 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--sequence_parallel true \
--micro_batch_size 1 \
--global_batch_size 32 \
--packing true \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--optimizer_cpu_offload false \
--attention_backend flash
Debug Environment Variables
NCCL_DEBUG=WARN
TORCH_DISTRIBUTED_DEBUG=DETAIL
NCCL_TIMEOUT=1800
NCCL_ASYNC_ERROR_HANDLING=1
Questions
- Is GLM 5.1 (MoE + DSA) with
pipeline_model_parallel_size > 1 supported? Are there known issues with pipeline parallel for MoE models?
- The "unbatched P2P op" warning appears despite
batch_p2p_comm=True in the model config — is this expected? Could this cause a deadlock during lazy NCCL communicator initialization?
- Is there a recommended parallelism configuration for GLM 5.1 LoRA training on 128 GPUs?
Related Issues
How to Reproduce / 如何复现
NPROC_PER_NODE=${n_gpus_per_node}
NNODES=${nnodes}
NODE_RANK=${ARNOLD_ID}
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--model ${MODEL_PATH}
--dataset ${DATASET_PATH}
--tuner_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--merge_lora false
--tensor_model_parallel_size 4
--expert_model_parallel_size 8
--pipeline_model_parallel_size 4
--decoder_last_pipeline_num_layers 18
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 0.01
--dsa_indexer_loss_coeff 0.01
--sequence_parallel true
--micro_batch_size 1
--global_batch_size 32
--packing true
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--cross_entropy_loss_fusion true
--optimizer_cpu_offload false
--optimizer_offload_fraction 1
--lr 1e-4
--lr_warmup_fraction 0.05
--min_lr 1e-5
--train_iters 100
--max_length 8192
--save_steps 100
--logging_steps 1
--output_dir ${OUTPUT_DIR}
--dataloader_num_workers 4
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--attention_backend flash
--agent_template glm5_1
Additional Information / 补充信息
No response
Checklist / 检查清单
Bug Description / Bug 描述
128 卡lora training
GLM 5.1 Megatron LoRA SFT hangs at first training step with Pipeline Parallel on 128 GPUs
Problem
Training GLM 5.1 (MoE, 78 layers, 256 experts) with Megatron LoRA SFT hangs at the first training step and never produces any loss output. The progress bar stays at
Train: 0%| | 0/100. No NCCL timeout, no OOM, no Python traceback — just a silent hang.This has been reproduced consistently across multiple configurations over 20+ attempts.
Environment
glm_moe_dsa, 78 layers, 256 experts, DSA attention)Configurations Tried
Hang Symptoms (PP=4 case, observed for 9+ hours)
Train: 0%| | 0/100GPU metrics: Only 3/16 executors show GPU utilization (~50-60%), the other 13 are idle. All executors have memory allocated (45-65 GiB), meaning processes are alive but stuck.
No NCCL timeout even with
NCCL_TIMEOUT=1800+NCCL_ASYNC_ERROR_HANDLING=1set — after 9+ hours, no timeout error was triggered.Key Observation
The model config shows
batch_p2p_comm=True, batch_p2p_sync=True, but the warning says "An unbatched P2P op was called". This mismatch suggests some pipeline P2P operations bypass the batched path and trigger lazy NCCL communicator initialization, which appears to deadlock.Relevant Config Snippet
megatron sft \ --model GLM-5.1 \ --tuner_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --tensor_model_parallel_size 4 \ --expert_model_parallel_size 8 \ --pipeline_model_parallel_size 4 \ --decoder_last_pipeline_num_layers 18 \ --moe_grouped_gemm true \ --moe_shared_expert_overlap true \ --sequence_parallel true \ --micro_batch_size 1 \ --global_batch_size 32 \ --packing true \ --recompute_granularity full \ --recompute_method uniform \ --recompute_num_layers 1 \ --finetune true \ --cross_entropy_loss_fusion true \ --optimizer_cpu_offload false \ --attention_backend flashDebug Environment Variables
Questions
pipeline_model_parallel_size > 1supported? Are there known issues with pipeline parallel for MoE models?batch_p2p_comm=Truein the model config — is this expected? Could this cause a deadlock during lazy NCCL communicator initialization?Related Issues
How to Reproduce / 如何复现
NPROC_PER_NODE=${n_gpus_per_node}
NNODES=${nnodes}
NODE_RANK=${ARNOLD_ID}
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--model ${MODEL_PATH}
--dataset ${DATASET_PATH}
--tuner_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--merge_lora false
--tensor_model_parallel_size 4
--expert_model_parallel_size 8
--pipeline_model_parallel_size 4
--decoder_last_pipeline_num_layers 18
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 0.01
--dsa_indexer_loss_coeff 0.01
--sequence_parallel true
--micro_batch_size 1
--global_batch_size 32
--packing true
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--cross_entropy_loss_fusion true
--optimizer_cpu_offload false
--optimizer_offload_fraction 1
--lr 1e-4
--lr_warmup_fraction 0.05
--min_lr 1e-5
--train_iters 100
--max_length 8192
--save_steps 100
--logging_steps 1
--output_dir ${OUTPUT_DIR}
--dataloader_num_workers 4
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--attention_backend flash
--agent_template glm5_1
Additional Information / 补充信息
No response