add multinode train support by Liccol · Pull Request #299 · vllm-project/speculators

Liccol · 2026-02-24T08:30:39Z

This PR adds multi-node training support. The main changes include adding multi-node training parameters, correcting rank parameter usage, and improving README.

Changes

1. Multi-node Training Parameters

Added 5 new multi-node training parameters in scripts/gen_and_train.py:
- nproc_per_node: Number of processes per node
- nnodes: Number of nodes
- node_rank: Current node rank
- master_addr: Master node address
- master_port: Master node port

2. Rank Parameter Correction

Corrected some local_rank usage, such as MultipackDistributedBatchSamplerV2 using global rank for data sharding

3. README Updates

Added detailed documentation for multi-node training parameters in scripts/README.md
Added multi-node training startup example commands
Added documentation for scheduler-related parameters

Testing

Single-node Training

python scripts/data_generation_offline.py --train-data-path sharegpt --turn-dropout --seq-length 2048 --target-model-path /home/data/weights/Qwen3-30B-A3B/ --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen/sharegpt

python scripts/build_vocab_mapping.py --draft-vocab-size 32000 --target-vocab-size 151936 --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping

torchrun --standalone --nproc_per_node=16 scripts/train.py --run-name qwen3_30b_eagle_chat_2node --logger trackio --lr 3e-05 --total-seq-len 2048 --epochs 10 --verifier-name-or-path /home/data/weights/Qwen3-30B-A3B/ --data-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen --save-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/checkpoints --log-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/logs --d2t-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/d2t.npy --t2d-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/t2d.npy

Multi-node Training

python scripts/data_generation_offline.py --train-data-path sharegpt --turn-dropout --seq-length 2048 --target-model-path /home/data/weights/Qwen3-30B-A3B/ --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen/sharegpt

python scripts/build_vocab_mapping.py --draft-vocab-size 32000 --target-vocab-size 151936 --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping

# Node 0
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 --master_addr="192.168.13.111" --master_port=12346 scripts/train.py --run-name qwen3_30b_eagle_chat_2node --logger trackio --lr 3e-05 --total-seq-len 2048 --epochs 10 --verifier-name-or-path /home/data/weights/Qwen3-30B-A3B/ --data-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen --save-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/checkpoints --log-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/logs --d2t-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/d2t.npy --t2d-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/t2d.npy

# Node 1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 --master_addr="192.168.13.111" --master_port=12346 scripts/train.py --run-name qwen3_30b_eagle_chat_2node --logger trackio --lr 3e-05 --total-seq-len 2048 --epochs 10 --verifier-name-or-path /home/data/weights/Qwen3-30B-A3B/ --data-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen --save-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/checkpoints --log-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/logs --d2t-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/d2t.npy --t2d-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/t2d.npy

Signed-off-by: CaranLic <740821011@qq.com>

mergify · 2026-02-24T16:42:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Liccol.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fynnsu

Looks good, thanks for working on this!

I added a couple comments below. Please let me know if you have any questions about anything. Also, have you had a chance to test this yet (and if so, on what kind of setup)?

# Conflicts: # scripts/gen_and_train.py

Signed-off-by: CaranLic <740821011@qq.com>

Liccol · 2026-03-16T06:34:19Z

Looks good, thanks for working on this!

I added a couple comments below. Please let me know if you have any questions about anything. Also, have you had a chance to test this yet (and if so, on what kind of setup)?

test has been added in the PR. Please take a look and see if there are any other comments @fynnsu

mergify · 2026-03-17T01:15:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Liccol.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # scripts/gen_and_train.py # src/speculators/train/trainer.py

fynnsu

Hi @Liccol,

I requested some changes below, please take a look.

Signed-off-by: CaranLic <740821011@qq.com>

mergify · 2026-03-26T18:44:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Liccol.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: CaranLic <740821011@qq.com>

# Conflicts: # scripts/train.py

Liccol · 2026-04-01T07:02:33Z

Hi @Liccol,

I requested some changes below, please take a look.

Hi @fynnsu
I've addressed the review comments, please review again and let me know if it can be merged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@Liccol

Use global rank instead of local_rank when creating MultipackDistributedBatchSamplerV2 for data sharding and when gating tqdm progress bars. In multi-node setups, local_rank repeats across nodes, causing data duplication and duplicate progress bars. Changes: - scripts/train.py: setup_dataloader() now accepts and passes global rank to the batch sampler - src/speculators/train/trainer.py: use global rank for tqdm guards - src/speculators/train/utils.py: include global rank in distributed setup log message Part of vllm-project#356. Rebased version of vllm-project#299 by @Liccol. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>

add multinode support

c918ddb

Signed-off-by: CaranLic <740821011@qq.com>

mergify Bot added the needs-rebase label Feb 24, 2026

fynnsu reviewed Feb 24, 2026

View reviewed changes

Comment thread scripts/gen_and_train.py Outdated

Comment thread src/speculators/train/trainer.py Outdated

Merge remote-tracking branch 'main-repo/main'

d5999df

# Conflicts: # scripts/gen_and_train.py

mergify Bot removed the needs-rebase label Mar 5, 2026

Liccol added 4 commits March 10, 2026 15:06

add README for multinode train.py execution

953b03c

Signed-off-by: CaranLic <740821011@qq.com>

change local_ran to rank for val_epoch

23d2974

Signed-off-by: CaranLic <740821011@qq.com>

rollback metric print

8121118

Signed-off-by: CaranLic <740821011@qq.com>

Correct local_rank param name to global rank

e716e51

Signed-off-by: CaranLic <740821011@qq.com>

Liccol force-pushed the main branch from 5dc9b0c to e716e51 Compare March 10, 2026 07:06

codereview

9a3d404

Signed-off-by: CaranLic <740821011@qq.com>

mergify Bot added the needs-rebase label Mar 17, 2026

Liccol mentioned this pull request Mar 23, 2026

[RFC]: Add Multi-node Training Suppor #356

Open

2 tasks

Merge branch 'main' of https://github.com/vllm-project/speculators

c3054da

# Conflicts: # scripts/gen_and_train.py # src/speculators/train/trainer.py

mergify Bot removed the needs-rebase label Mar 23, 2026

Liccol added 2 commits March 24, 2026 09:13

Merge branch 'main' into main

d1ef2d6

Merge branch 'main' into main

f24e0c0

fynnsu requested changes Mar 25, 2026

View reviewed changes

Comment thread scripts/train.py Outdated

Comment thread scripts/train.py Outdated

Comment thread scripts/README.md Outdated

Comment thread scripts/gen_and_train.py Outdated

Liccol added 3 commits March 26, 2026 15:53

remove print with log and remove unused multinode vars

1ff935b

Signed-off-by: CaranLic <740821011@qq.com>

Merge branch 'main' of https://github.com/Liccol/speculators

30acd7c

remove unused vars and refactor README for train.py

457eaef

Signed-off-by: CaranLic <740821011@qq.com>

mergify Bot added the needs-rebase label Mar 26, 2026

Liccol added 2 commits March 27, 2026 11:13

remove default vars

3cad907

Signed-off-by: CaranLic <740821011@qq.com>

Merge remote-tracking branch 'ori-repo/main'

b677088

# Conflicts: # scripts/train.py

mergify Bot removed the needs-rebase label Mar 27, 2026

Liccol requested a review from fynnsu March 27, 2026 03:27

Liccol added 2 commits March 28, 2026 08:57

Merge branch 'main' into main

4234818

Merge branch 'main' into main

35bb123

YzTongNiar added a commit to YzTongNiar/speculators that referenced this pull request Apr 10, 2026

Merge PR vllm-project#299: add multinode train support

429a20b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

YzTongNiar added a commit to YzTongNiar/speculators that referenced this pull request Apr 10, 2026

Merge PR vllm-project#299: add multinode train support

e55ed3b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

YzTongNiar added a commit to YzTongNiar/speculators that referenced this pull request Apr 11, 2026

add multinode train support (squash merge PR vllm-project#299)

93cdf21

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ianliuy mentioned this pull request Apr 15, 2026

Fix local_rank usage in distributed batch sampler for multi-node training #427

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add multinode train support#299

add multinode train support#299
Liccol wants to merge 17 commits intovllm-project:mainfrom
Liccol:main

Liccol commented Feb 24, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

fynnsu left a comment

Uh oh!

Uh oh!

Uh oh!

Liccol commented Mar 16, 2026

Uh oh!

mergify Bot commented Mar 17, 2026

Uh oh!

fynnsu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Mar 26, 2026

Uh oh!

Liccol commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Liccol commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

1. Multi-node Training Parameters

2. Rank Parameter Correction

3. README Updates

Testing

Single-node Training

Multi-node Training

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Liccol commented Mar 16, 2026

Uh oh!

mergify Bot commented Mar 17, 2026

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Mar 26, 2026

Uh oh!

Liccol commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Liccol commented Feb 24, 2026 •

edited

Loading