add support for hsdp and configurable nccl timeout by haok1402 · Pull Request #40 · mlc-ai/Pith-Train

haok1402 · 2026-05-16T03:40:20Z

TLDR: HSDP replicates across DP ; other dimensions still shard.

Small models trained across many devices for shorter wall-clock time often fit comfortably within a fraction of the cluster, making full-mesh sharding with FSDP unnecessary and creating overhead that HSDP avoids by sharding within a smaller intra-replica group and replicating across the rest.

In fact, on Qwen3-30B-A3B at 8×8 H100 with PP=4, CP=1, EP=8, MBS=1, GBS=1024, switching from FSDP to HSDP raised steady-state throughput from ~293K to ~392K tokens/sec (a 1.34× speedup) at a ~5.6 GB increase in per-device peak memory (48.3 → 53.9 GB).

8n8g-h100/pp4-cp1-ep8-mbs1-gbs1024-seq4096-bf16_fsdp

2026-05-15 09:18:32 | INFO | step 00000021/00000025 | step-time 14.262 sec | cross-entropy-loss 2.5706 | load-balance-loss 1.737396 | learning-rate 1.000000e-06 | gradient-norm 1077.3718 | tokens-per-second 294,083 | peak-gpu-memory 48.33 GB
2026-05-15 09:18:47 | INFO | step 00000022/00000025 | step-time 14.630 sec | cross-entropy-loss 2.5880 | load-balance-loss 1.749288 | learning-rate 1.000000e-06 | gradient-norm 795.1832 | tokens-per-second 286,688 | peak-gpu-memory 48.30 GB
2026-05-15 09:19:02 | INFO | step 00000023/00000025 | step-time 14.203 sec | cross-entropy-loss 2.6752 | load-balance-loss 1.768129 | learning-rate 1.000000e-06 | gradient-norm 889.8781 | tokens-per-second 295,311 | peak-gpu-memory 48.22 GB
2026-05-15 09:19:16 | INFO | step 00000024/00000025 | step-time 14.189 sec | cross-entropy-loss 2.5337 | load-balance-loss 1.756927 | learning-rate 1.000000e-06 | gradient-norm 754.6648 | tokens-per-second 295,598 | peak-gpu-memory 48.33 GB
2026-05-15 09:19:30 | INFO | step 00000025/00000025 | step-time 13.242 sec | cross-entropy-loss 2.4828 | load-balance-loss 1.787435 | learning-rate 1.000000e-06 | gradient-norm 670.3105 | tokens-per-second 316,738 | peak-gpu-memory 48.19 GB

8n8g-h100/pp4-cp1-ep8-mbs1-gbs1024-seq4096-bf16_hsdp

2026-05-15 09:44:09 | INFO | step 00000021/00000025 | step-time 10.501 sec | cross-entropy-loss 2.6272 | load-balance-loss 1.753679 | learning-rate 1.000000e-06 | gradient-norm 1209.2335 | tokens-per-second 399,410 | peak-gpu-memory 53.99 GB
2026-05-15 09:44:20 | INFO | step 00000022/00000025 | step-time 10.544 sec | cross-entropy-loss 2.5974 | load-balance-loss 1.759560 | learning-rate 1.000000e-06 | gradient-norm 1030.8104 | tokens-per-second 397,806 | peak-gpu-memory 53.86 GB
2026-05-15 09:44:31 | INFO | step 00000023/00000025 | step-time 10.733 sec | cross-entropy-loss 2.5971 | load-balance-loss 1.755857 | learning-rate 1.000000e-06 | gradient-norm 1003.5530 | tokens-per-second 390,794 | peak-gpu-memory 53.98 GB
2026-05-15 09:44:41 | INFO | step 00000024/00000025 | step-time 10.563 sec | cross-entropy-loss 2.5107 | load-balance-loss 1.793943 | learning-rate 1.000000e-06 | gradient-norm 1723.6670 | tokens-per-second 397,061 | peak-gpu-memory 53.98 GB
2026-05-15 09:44:52 | INFO | step 00000025/00000025 | step-time 10.730 sec | cross-entropy-loss 2.6037 | load-balance-loss 1.739350 | learning-rate 1.000000e-06 | gradient-norm 1366.6276 | tokens-per-second 390,884 | peak-gpu-memory 53.96 GB

gemini-code-assist

Code Review

This pull request introduces configurable NCCL timeouts and support for the Hybrid Sharded Data Parallel (HSDP) sharding strategy. It adds nccl_timeout_seconds and sharding_strategy to the distributed configuration and updates the FSDP application logic to handle HSDP mesh construction. Feedback includes a recommendation to retain the default heartbeat timeout in environment setup to prevent regressions in alternative initialization paths and a suggestion to use public DeviceMesh APIs instead of private internal methods for better maintainability.

add support for hsdp and configurable nccl timeout

254d93e

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Comment thread pithtrain/modules/shutdown.py

Comment thread pithtrain/modules/training.py

MasterJH5574 approved these changes May 16, 2026

View reviewed changes

MasterJH5574 merged commit 093a9e4 into mlc-ai:main May 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for hsdp and configurable nccl timeout#40

add support for hsdp and configurable nccl timeout#40
MasterJH5574 merged 1 commit into
mlc-ai:mainfrom
haok1402:0515-scaling-up

haok1402 commented May 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haok1402 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haok1402 commented May 16, 2026 •

edited

Loading