GLM5.1 MoE + PP 训练卡在 Train 0/100：batch_p2p_comm=True 但实际触发 unbatched P2P send/recv lazy NCCL communicator init

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

128 卡lora training 

<img width="998" height="572" alt="Image" src="https://github.com/user-attachments/assets/ffa27a11-c24d-466c-984c-165218562e11" />

## GLM 5.1 Megatron LoRA SFT hangs at first training step with Pipeline Parallel on 128 GPUs

### Problem

Training GLM 5.1 (MoE, 78 layers, 256 experts) with Megatron LoRA SFT **hangs at the first training step** and never produces any loss output. The progress bar stays at `Train: 0%| | 0/100`. No NCCL timeout, no OOM, no Python traceback — just a silent hang.

This has been reproduced consistently across multiple configurations over 20+ attempts.

### Environment

- **ms-swift**: 4.3.0.dev0
- **megatron-core**: 0.17.0
- **mcore-bridge**: 1.4.1 (+ GitHub main for rotary_interleaved fix)
- **transformer-engine**: 2.15.0
- **PyTorch**: 2.10.0, CUDA 13.1
- **Hardware**: 16 nodes × 8 H800 GPUs = **128 GPUs**
- **Model**: ZhipuAI/GLM-5.1 (`glm_moe_dsa`, 78 layers, 256 experts, DSA attention)

### Configurations Tried

| PP | TP | EP | DP | global_batch_size | Result |
|----|----|----|-----|-------------------|--------|
| 1 | 4 | 8 | 32 | 32 | **OOM** during LoRA adapter injection (single card ~95GB full) |
| 2 | 4 | 8 | 8 | 32 | **OOM** during LoRA adapter injection |
| 4 | 4 | 8 | 8 | 32 | **Hang** at first step (9+ hours, no timeout) |
| 8 | 4 | 8 | 4 | 32 | **Hang** at first step (testing) |

### Hang Symptoms (PP=4 case, observed for 9+ hours)

1. Model loads successfully (78 layers, 46.9B params, 192M trainable LoRA)
2. Dataset processes successfully (52k samples, packing to 8192, size=1023)
3. Progress bar appears: `Train: 0%| | 0/100`
4. The following warning appears, then no further progress:
```
[rankN]:[W ProcessGroupNCCL.cpp:4071] Warning: An unbatched P2P op (send/recv) 
was called on this ProcessGroup with size 4. In lazy initialization mode, this 
will result in a new 2-rank NCCL communicator to be created.
```

5. **GPU metrics**: Only 3/16 executors show GPU utilization (~50-60%), the other 13 are idle. All executors have memory allocated (45-65 GiB), meaning processes are alive but stuck.

6. **No NCCL timeout** even with `NCCL_TIMEOUT=1800` + `NCCL_ASYNC_ERROR_HANDLING=1` set — after 9+ hours, no timeout error was triggered.

### Key Observation

The model config shows `batch_p2p_comm=True, batch_p2p_sync=True`, but the warning says "An **unbatched** P2P op was called". This mismatch suggests some pipeline P2P operations bypass the batched path and trigger lazy NCCL communicator initialization, which appears to deadlock.

### Relevant Config Snippet

```bash
megatron sft \
    --model GLM-5.1 \
    --tuner_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --tensor_model_parallel_size 4 \
    --expert_model_parallel_size 8 \
    --pipeline_model_parallel_size 4 \
    --decoder_last_pipeline_num_layers 18 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --sequence_parallel true \
    --micro_batch_size 1 \
    --global_batch_size 32 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --optimizer_cpu_offload false \
    --attention_backend flash
```

### Debug Environment Variables

```bash
NCCL_DEBUG=WARN
TORCH_DISTRIBUTED_DEBUG=DETAIL
NCCL_TIMEOUT=1800
NCCL_ASYNC_ERROR_HANDLING=1
```

### Questions

1. Is GLM 5.1 (MoE + DSA) with `pipeline_model_parallel_size > 1` supported? Are there known issues with pipeline parallel for MoE models?
2. The "unbatched P2P op" warning appears despite `batch_p2p_comm=True` in the model config — is this expected? Could this cause a deadlock during lazy NCCL communicator initialization?
3. Is there a recommended parallelism configuration for GLM 5.1 LoRA training on 128 GPUs?

### Related Issues

- #6312 — Swift Megatron SFT training hangs after thousands of steps on 128 GPUs with MoE + CP + PP
- #6476 — Qwen3-235B megatron pipeline并行报错
- #9191 — GLM5 全参数 SFT OOM


### How to Reproduce / 如何复现

NPROC_PER_NODE=${n_gpus_per_node} \
NNODES=${nnodes} \
NODE_RANK=${ARNOLD_ID} \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
    --model ${MODEL_PATH} \
    --dataset ${DATASET_PATH} \
    --tuner_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --merge_lora false \
    --tensor_model_parallel_size 4 \
    --expert_model_parallel_size 8 \
    --pipeline_model_parallel_size 4 \
    --decoder_last_pipeline_num_layers 18 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --dsa_indexer_loss_coeff 0.01 \
    --sequence_parallel true \
    --micro_batch_size 1 \
    --global_batch_size 32 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --optimizer_cpu_offload false \
    --optimizer_offload_fraction 1 \
    --lr 1e-4 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-5 \
    --train_iters 100 \
    --max_length 8192 \
    --save_steps 100 \
    --logging_steps 1 \
    --output_dir ${OUTPUT_DIR} \
    --dataloader_num_workers 4 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --attention_backend flash \
    --agent_template glm5_1


### Additional Information / 补充信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM5.1 MoE + PP 训练卡在 Train 0/100：batch_p2p_comm=True 但实际触发 unbatched P2P send/recv lazy NCCL communicator init #9451

Checklist / 检查清单

Bug Description / Bug 描述

GLM 5.1 Megatron LoRA SFT hangs at first training step with Pipeline Parallel on 128 GPUs

Problem

Environment

Configurations Tried

Hang Symptoms (PP=4 case, observed for 9+ hours)

Key Observation

Relevant Config Snippet

Debug Environment Variables

Questions

Related Issues

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PP	TP	EP	DP	global_batch_size	Result
1	4	8	32	32	OOM during LoRA adapter injection (single card ~95GB full)
2	4	8	8	32	OOM during LoRA adapter injection
4	4	8	8	32	Hang at first step (9+ hours, no timeout)
8	4	8	4	32	Hang at first step (testing)

GLM5.1 MoE + PP 训练卡在 Train 0/100：batch_p2p_comm=True 但实际触发 unbatched P2P send/recv lazy NCCL communicator init #9451

Description

Checklist / 检查清单

Bug Description / Bug 描述

GLM 5.1 Megatron LoRA SFT hangs at first training step with Pipeline Parallel on 128 GPUs

Problem

Environment

Configurations Tried

Hang Symptoms (PP=4 case, observed for 9+ hours)

Key Observation

Relevant Config Snippet

Debug Environment Variables

Questions

Related Issues

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions