Skip to content

add support for hsdp and configurable nccl timeout#40

Merged
MasterJH5574 merged 1 commit into
mlc-ai:mainfrom
haok1402:0515-scaling-up
May 16, 2026
Merged

add support for hsdp and configurable nccl timeout#40
MasterJH5574 merged 1 commit into
mlc-ai:mainfrom
haok1402:0515-scaling-up

Conversation

@haok1402
Copy link
Copy Markdown
Collaborator

@haok1402 haok1402 commented May 16, 2026

TLDR: HSDP replicates across DP ; other dimensions still shard.


Small models trained across many devices for shorter wall-clock time often fit comfortably within a fraction of the cluster, making full-mesh sharding with FSDP unnecessary and creating overhead that HSDP avoids by sharding within a smaller intra-replica group and replicating across the rest.

In fact, on Qwen3-30B-A3B at 8×8 H100 with PP=4, CP=1, EP=8, MBS=1, GBS=1024, switching from FSDP to HSDP raised steady-state throughput from ~293K to ~392K tokens/sec (a 1.34× speedup) at a ~5.6 GB increase in per-device peak memory (48.3 → 53.9 GB).

8n8g-h100/pp4-cp1-ep8-mbs1-gbs1024-seq4096-bf16_fsdp

2026-05-15 09:18:32 | INFO | step 00000021/00000025 | step-time 14.262 sec | cross-entropy-loss 2.5706 | load-balance-loss 1.737396 | learning-rate 1.000000e-06 | gradient-norm 1077.3718 | tokens-per-second 294,083 | peak-gpu-memory 48.33 GB
2026-05-15 09:18:47 | INFO | step 00000022/00000025 | step-time 14.630 sec | cross-entropy-loss 2.5880 | load-balance-loss 1.749288 | learning-rate 1.000000e-06 | gradient-norm 795.1832 | tokens-per-second 286,688 | peak-gpu-memory 48.30 GB
2026-05-15 09:19:02 | INFO | step 00000023/00000025 | step-time 14.203 sec | cross-entropy-loss 2.6752 | load-balance-loss 1.768129 | learning-rate 1.000000e-06 | gradient-norm 889.8781 | tokens-per-second 295,311 | peak-gpu-memory 48.22 GB
2026-05-15 09:19:16 | INFO | step 00000024/00000025 | step-time 14.189 sec | cross-entropy-loss 2.5337 | load-balance-loss 1.756927 | learning-rate 1.000000e-06 | gradient-norm 754.6648 | tokens-per-second 295,598 | peak-gpu-memory 48.33 GB
2026-05-15 09:19:30 | INFO | step 00000025/00000025 | step-time 13.242 sec | cross-entropy-loss 2.4828 | load-balance-loss 1.787435 | learning-rate 1.000000e-06 | gradient-norm 670.3105 | tokens-per-second 316,738 | peak-gpu-memory 48.19 GB

8n8g-h100/pp4-cp1-ep8-mbs1-gbs1024-seq4096-bf16_hsdp

2026-05-15 09:44:09 | INFO | step 00000021/00000025 | step-time 10.501 sec | cross-entropy-loss 2.6272 | load-balance-loss 1.753679 | learning-rate 1.000000e-06 | gradient-norm 1209.2335 | tokens-per-second 399,410 | peak-gpu-memory 53.99 GB
2026-05-15 09:44:20 | INFO | step 00000022/00000025 | step-time 10.544 sec | cross-entropy-loss 2.5974 | load-balance-loss 1.759560 | learning-rate 1.000000e-06 | gradient-norm 1030.8104 | tokens-per-second 397,806 | peak-gpu-memory 53.86 GB
2026-05-15 09:44:31 | INFO | step 00000023/00000025 | step-time 10.733 sec | cross-entropy-loss 2.5971 | load-balance-loss 1.755857 | learning-rate 1.000000e-06 | gradient-norm 1003.5530 | tokens-per-second 390,794 | peak-gpu-memory 53.98 GB
2026-05-15 09:44:41 | INFO | step 00000024/00000025 | step-time 10.563 sec | cross-entropy-loss 2.5107 | load-balance-loss 1.793943 | learning-rate 1.000000e-06 | gradient-norm 1723.6670 | tokens-per-second 397,061 | peak-gpu-memory 53.98 GB
2026-05-15 09:44:52 | INFO | step 00000025/00000025 | step-time 10.730 sec | cross-entropy-loss 2.6037 | load-balance-loss 1.739350 | learning-rate 1.000000e-06 | gradient-norm 1366.6276 | tokens-per-second 390,884 | peak-gpu-memory 53.96 GB

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces configurable NCCL timeouts and support for the Hybrid Sharded Data Parallel (HSDP) sharding strategy. It adds nccl_timeout_seconds and sharding_strategy to the distributed configuration and updates the FSDP application logic to handle HSDP mesh construction. Feedback includes a recommendation to retain the default heartbeat timeout in environment setup to prevent regressions in alternative initialization paths and a suggestion to use public DeviceMesh APIs instead of private internal methods for better maintainability.

Comment thread pithtrain/modules/shutdown.py
Comment thread pithtrain/modules/training.py
@MasterJH5574 MasterJH5574 merged commit 093a9e4 into mlc-ai:main May 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants