Skip to content

add multinode train support#299

Open
Liccol wants to merge 17 commits intovllm-project:mainfrom
Liccol:main
Open

add multinode train support#299
Liccol wants to merge 17 commits intovllm-project:mainfrom
Liccol:main

Conversation

@Liccol
Copy link
Copy Markdown

@Liccol Liccol commented Feb 24, 2026

This PR adds multi-node training support. The main changes include adding multi-node training parameters, correcting rank parameter usage, and improving README.

Changes

1. Multi-node Training Parameters

  • Added 5 new multi-node training parameters in scripts/gen_and_train.py:
    • nproc_per_node: Number of processes per node
    • nnodes: Number of nodes
    • node_rank: Current node rank
    • master_addr: Master node address
    • master_port: Master node port

2. Rank Parameter Correction

  • Corrected some local_rank usage, such as MultipackDistributedBatchSamplerV2 using global rank for data sharding

3. README Updates

  • Added detailed documentation for multi-node training parameters in scripts/README.md
  • Added multi-node training startup example commands
  • Added documentation for scheduler-related parameters

Testing

Single-node Training

python scripts/data_generation_offline.py --train-data-path sharegpt --turn-dropout --seq-length 2048 --target-model-path /home/data/weights/Qwen3-30B-A3B/ --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen/sharegpt

python scripts/build_vocab_mapping.py --draft-vocab-size 32000 --target-vocab-size 151936 --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping

torchrun --standalone --nproc_per_node=16 scripts/train.py --run-name qwen3_30b_eagle_chat_2node --logger trackio --lr 3e-05 --total-seq-len 2048 --epochs 10 --verifier-name-or-path /home/data/weights/Qwen3-30B-A3B/ --data-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen --save-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/checkpoints --log-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/logs --d2t-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/d2t.npy --t2d-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/t2d.npy

Multi-node Training

python scripts/data_generation_offline.py --train-data-path sharegpt --turn-dropout --seq-length 2048 --target-model-path /home/data/weights/Qwen3-30B-A3B/ --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen/sharegpt

python scripts/build_vocab_mapping.py --draft-vocab-size 32000 --target-vocab-size 151936 --token-freq-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/token_freq_sharegpt.pt --output-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping

# Node 0
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 --master_addr="192.168.13.111" --master_port=12346 scripts/train.py --run-name qwen3_30b_eagle_chat_2node --logger trackio --lr 3e-05 --total-seq-len 2048 --epochs 10 --verifier-name-or-path /home/data/weights/Qwen3-30B-A3B/ --data-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen --save-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/checkpoints --log-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/logs --d2t-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/d2t.npy --t2d-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/t2d.npy

# Node 1
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 --master_addr="192.168.13.111" --master_port=12346 scripts/train.py --run-name qwen3_30b_eagle_chat_2node --logger trackio --lr 3e-05 --total-seq-len 2048 --epochs 10 --verifier-name-or-path /home/data/weights/Qwen3-30B-A3B/ --data-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/gen --save-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/checkpoints --log-dir /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/logs --d2t-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/d2t.npy --t2d-path /home/data/ckpt/qwen3-30b-eagle-chat-temp07-3w-2node/vocab_mapping/t2d.npy

Signed-off-by: CaranLic <740821011@qq.com>
@mergify
Copy link
Copy Markdown

mergify Bot commented Feb 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Liccol.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Feb 24, 2026
Copy link
Copy Markdown
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for working on this!

I added a couple comments below. Please let me know if you have any questions about anything. Also, have you had a chance to test this yet (and if so, on what kind of setup)?

Comment thread scripts/gen_and_train.py Outdated
Comment thread src/speculators/train/trainer.py Outdated
# Conflicts:
#	scripts/gen_and_train.py
@mergify mergify Bot removed the needs-rebase label Mar 5, 2026
Liccol added 4 commits March 10, 2026 15:06
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
@Liccol
Copy link
Copy Markdown
Author

Liccol commented Mar 16, 2026

Looks good, thanks for working on this!

I added a couple comments below. Please let me know if you have any questions about anything. Also, have you had a chance to test this yet (and if so, on what kind of setup)?

test has been added in the PR. Please take a look and see if there are any other comments @fynnsu

@mergify
Copy link
Copy Markdown

mergify Bot commented Mar 17, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Liccol.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts:
#	scripts/gen_and_train.py
#	src/speculators/train/trainer.py
@mergify mergify Bot removed the needs-rebase label Mar 23, 2026
Copy link
Copy Markdown
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Liccol,

I requested some changes below, please take a look.

Comment thread scripts/train.py Outdated
Comment thread scripts/train.py Outdated
Comment thread scripts/README.md Outdated
Comment thread scripts/gen_and_train.py Outdated
@mergify
Copy link
Copy Markdown

mergify Bot commented Mar 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Liccol.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 26, 2026
Liccol added 2 commits March 27, 2026 11:13
Signed-off-by: CaranLic <740821011@qq.com>
# Conflicts:
#	scripts/train.py
@mergify mergify Bot removed the needs-rebase label Mar 27, 2026
@Liccol Liccol requested a review from fynnsu March 27, 2026 03:27
@Liccol
Copy link
Copy Markdown
Author

Liccol commented Apr 1, 2026

Hi @Liccol,

I requested some changes below, please take a look.

Hi @fynnsu
I've addressed the review comments, please review again and let me know if it can be merged.

YzTongNiar added a commit to YzTongNiar/speculators that referenced this pull request Apr 10, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
YzTongNiar added a commit to YzTongNiar/speculators that referenced this pull request Apr 10, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
YzTongNiar added a commit to YzTongNiar/speculators that referenced this pull request Apr 11, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ianliuy added a commit to ianliuy/speculators that referenced this pull request Apr 15, 2026
Use global rank instead of local_rank when creating
MultipackDistributedBatchSamplerV2 for data sharding and when
gating tqdm progress bars. In multi-node setups, local_rank
repeats across nodes, causing data duplication and duplicate
progress bars.

Changes:
- scripts/train.py: setup_dataloader() now accepts and passes
  global rank to the batch sampler
- src/speculators/train/trainer.py: use global rank for tqdm guards
- src/speculators/train/utils.py: include global rank in distributed
  setup log message

Part of vllm-project#356. Rebased version of vllm-project#299 by @Liccol.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants