Skip to content

Refactor distributed module#41

Open
haok1402 wants to merge 3 commits into
mlc-ai:mainfrom
haok1402:0516-refactor-distributed
Open

Refactor distributed module#41
haok1402 wants to merge 3 commits into
mlc-ai:mainfrom
haok1402:0516-refactor-distributed

Conversation

@haok1402
Copy link
Copy Markdown
Collaborator

Rewords the docstrings across the distributed module for clarity, and switches the operation timeout from an int (seconds) to a timedelta with a 15-minute default.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the distributed training configuration and setup, introducing a configurable timeout and sharding strategy (FSDP vs. HSDP). It also moves the fail-fast exception handling directly into the distributed module and updates the FSDP application logic to support the new sharding strategies. Feedback highlights a potential type mismatch when loading timeouts from JSON, the use of a private PyTorch API (DeviceMesh._concatenate) which may cause future compatibility issues, and a suggestion to improve the readability of the threading exception hook.

Comment thread pithtrain/modules/distributed.py
Comment thread pithtrain/modules/distributed.py
Comment thread pithtrain/modules/training.py
@haok1402 haok1402 force-pushed the 0516-refactor-distributed branch from db1db09 to 2f028ce Compare May 16, 2026 17:07
@haok1402 haok1402 force-pushed the 0516-refactor-distributed branch from 2f028ce to d066629 Compare May 17, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant