add multinode train support#299
Conversation
Signed-off-by: CaranLic <740821011@qq.com>
|
This pull request has merge conflicts that must be resolved before it can be |
fynnsu
left a comment
There was a problem hiding this comment.
Looks good, thanks for working on this!
I added a couple comments below. Please let me know if you have any questions about anything. Also, have you had a chance to test this yet (and if so, on what kind of setup)?
# Conflicts: # scripts/gen_and_train.py
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
test has been added in the PR. Please take a look and see if there are any other comments @fynnsu |
|
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # scripts/gen_and_train.py # src/speculators/train/trainer.py
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: CaranLic <740821011@qq.com>
# Conflicts: # scripts/train.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use global rank instead of local_rank when creating MultipackDistributedBatchSamplerV2 for data sharding and when gating tqdm progress bars. In multi-node setups, local_rank repeats across nodes, causing data duplication and duplicate progress bars. Changes: - scripts/train.py: setup_dataloader() now accepts and passes global rank to the batch sampler - src/speculators/train/trainer.py: use global rank for tqdm guards - src/speculators/train/utils.py: include global rank in distributed setup log message Part of vllm-project#356. Rebased version of vllm-project#299 by @Liccol. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
This PR adds multi-node training support. The main changes include adding multi-node training parameters, correcting rank parameter usage, and improving README.
Changes
1. Multi-node Training Parameters
2. Rank Parameter Correction
3. README Updates
Testing
Single-node Training
Multi-node Training