Refactor distributed module by haok1402 · Pull Request #41 · mlc-ai/Pith-Train

haok1402 · 2026-05-16T17:02:49Z

Rewords the docstrings across the distributed module for clarity, and switches the operation timeout from an int (seconds) to a timedelta with a 15-minute default.

gemini-code-assist

Code Review

This pull request refactors the distributed training configuration and setup, introducing a configurable timeout and sharding strategy (FSDP vs. HSDP). It also moves the fail-fast exception handling directly into the distributed module and updates the FSDP application logic to support the new sharding strategies. Feedback highlights a potential type mismatch when loading timeouts from JSON, the use of a private PyTorch API (DeviceMesh._concatenate) which may cause future compatibility issues, and a suggestion to improve the readability of the threading exception hook.

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Comment thread pithtrain/modules/distributed.py

Comment thread pithtrain/modules/distributed.py

Comment thread pithtrain/modules/training.py

haok1402 force-pushed the 0516-refactor-distributed branch from db1db09 to 2f028ce Compare May 16, 2026 17:07

haok1402 added 3 commits May 16, 2026 22:10

fold shutdown into distributed module

718f0d0

switch timeout to timedelta

235d557

update the docstring for distributed module.

d066629

haok1402 force-pushed the 0516-refactor-distributed branch from 2f028ce to d066629 Compare May 17, 2026 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor distributed module#41

Refactor distributed module#41
haok1402 wants to merge 3 commits into
mlc-ai:mainfrom
haok1402:0516-refactor-distributed

haok1402 commented May 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haok1402 commented May 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant