Skip to content
5 changes: 5 additions & 0 deletions convert_qwen2.5_ckpt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
source scripts/models/qwen2.5-0.5B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct \
--save /root/Qwen2.5-0.5B-Instruct_torch_dist/
Comment on lines +2 to +5
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script contains hardcoded absolute paths for PYTHONPATH, --hf-checkpoint, and --save. This makes the script not portable and difficult to use in different environments. It's recommended to use environment variables or script arguments to specify these paths.

Suggested change
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct \
--save /root/Qwen2.5-0.5B-Instruct_torch_dist/
export WORK_DIR=/root
export PYTHONPATH=${WORK_DIR}/Megatron-LM
HF_CHECKPOINT_PATH=${WORK_DIR}/Qwen2.5-0.5B-Instruct
SAVE_PATH=${WORK_DIR}/Qwen2.5-0.5B-Instruct_torch_dist/
python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint ${HF_CHECKPOINT_PATH} \
--save ${SAVE_PATH}

360 changes: 360 additions & 0 deletions docs/en/advanced/rfc-vllm-rollout-backend.md

Large diffs are not rendered by default.

462 changes: 462 additions & 0 deletions docs/en/vllm/ROUTER_DESIGN.md

Large diffs are not rendered by default.

33 changes: 33 additions & 0 deletions goal_plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
### 阶段一:打通Qwen2.5-0.5B GRPO 8卡同步/异步训练(train.py和train_async.py),GSM8K 数据集,loss/reward 收敛与 SGLang backend 基本一致,且满足确定性计算,多次重复运行Loss曲线完全一致。

First Design and RFC by 03/06

#### 初步方案:
- 对标SGLang,Slime 在 Ray 内管理 vLLM 的完整生命周期,包括进程拉起、权重同步、推理暂停/恢复
- 暂不使用Router,SGLang Model Gateway仅只支持SGLang Worker,SlimeRouter仅在 R3 / radix-tree caching 时需要,Qwen2.5-0.5B 非 MoE 且用 token-in/token-out
- 单vLLM实例,无router,通过vLLMClient 直连本地 vLLM 进程端口
- 先支持训推不共卡(non-colocate),权重同步采用NCCL broadcast,对标SGLang update_weights_from_distributed (默认)
- 再支持和验证colocate,权重同步采用GPU IPC(vLLM update_weights_from_ipc, update_weights_from_tensor),对标SGLang update_weights_from_tensor,以验证Reproductivity。**IPC 依赖vllm 0.17**

#### 风险:
- slime, sglang版本依赖,和vllm 0.16的版本依赖冲突(numpy, torch, transformers, etc)
- slime代码较挫,可靠性差,强依赖preset docker
- 算力


#### Reference

https://thudm.github.io/slime/advanced/reproducibility.html


### 阶段二:接入vllm-project/router,支持多实例vLLM

- vllm router forked from SGLang Model Gateway

### 阶段三:多节点大规模验证,MoE模型,optional:验证MTP Speculative Decoding,FP8 rollout 等高级特性

- Model: Qwen/Qwen3-30B-A3B or GLM4.7
- Parallel: 16卡 or 128卡, Train mixed EP+FSDP, Rollout EP+DP
- Verify more features:
- Bf16 train, FP8 rollout
- MTP Speculative Decoding
401 changes: 401 additions & 0 deletions rfc-vllm-rollout-backend-en.md

Large diffs are not rendered by default.

436 changes: 436 additions & 0 deletions rfc-vllm-rollout-backend.md

Large diffs are not rendered by default.

137 changes: 137 additions & 0 deletions run-qwen2.5-0.5B-reproducibility-noncolocate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
#!/bin/bash
# Non-colocate version of run-qwen2.5-0.5B-reproducibility.sh
# 2 GPUs: 1 for training, 1 for SGLang rollout

# for rerun the task
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python

set -ex

export PYTHONBUFFERED=16

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/scripts/models/qwen2.5-0.5B.sh"

CKPT_ARGS=(
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
--ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/
)

ROLLOUT_ARGS=(
--prompt-data /root/gsm8k/train.parquet
Comment on lines +23 to +28
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script contains hardcoded absolute paths (e.g., /root/Qwen2.5-0.5B-Instruct/, /root/gsm8k/train.parquet). This reduces portability. Consider parameterizing these paths using environment variables or script arguments to make the script more reusable across different environments.

--input-key messages
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type math
--num-rollout 100
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 1024
--rollout-temperature 1

--global-batch-size 256
)

EVAL_ARGS=(
--eval-interval 20
--eval-prompt-data gsm8k /root/gsm8k/test.parquet
--n-samples-per-eval-prompt 1
--eval-max-response-len 1024
--eval-top-k 1
)

PERF_ARGS=(
--tensor-model-parallel-size 1
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1

--use-dynamic-batch-size
--max-tokens-per-gpu 9216
)

GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--kl-coef 0.00
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)

OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)

WANDB_ARGS=(
--use-wandb
--wandb-host https://wandb.ai/
--wandb-entity samithuang
--wandb-project slime-rl
--wandb-group qwen2.5-0.5B-gsm8k-noncolocate
)

SGLANG_ARGS=(
--rollout-num-gpus-per-engine 1
--sglang-mem-fraction-static 0.7

--sglang-enable-deterministic-inference
--sglang-attention-backend flashinfer

--deterministic-mode
)

MISC_ARGS=(
--attention-dropout 0.0
--hidden-dropout 0.0
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
--attention-backend flash
)

ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats

ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{
"env_vars": {
"PYTHONPATH": "/root/Megatron-LM",
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"NCCL_ALGO": "Ring",
"NVTE_ALLOW_NONDETERMINISTIC_ALGO": "0",
"CUBLAS_WORKSPACE_CONFIG": ":4096:8"
}
}' \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 1 \
--num-gpus-per-node 2 \
--rollout-num-gpus 1 \
--calculate-per-token-loss \
--use-slime-router \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${WANDB_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${SGLANG_ARGS[@]} \
${MISC_ARGS[@]}
143 changes: 143 additions & 0 deletions run-qwen2.5-0.5B-vllm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
#!/bin/bash
# vLLM rollout backend validation script (Phase 1)
# Based on run-qwen2.5-0.5B-reproducibility.sh

# for rerun the task
pkill -9 vllm
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python


set -ex

export PYTHONBUFFERED=16

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/scripts/models/qwen2.5-0.5B.sh"

CKPT_ARGS=(
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
--ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/
)

# num-rollout:100
ROLLOUT_ARGS=(
--prompt-data /root/gsm8k/train.parquet
Comment on lines +25 to +31
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This script contains hardcoded absolute paths for model checkpoints and datasets (e.g., /root/Qwen2.5-0.5B-Instruct/, /root/gsm8k/train.parquet). This is not portable. It would be better to define these paths as variables at the top of the script or pass them as arguments, which would make the script easier to adapt for different setups.

--input-key messages
--label-key label
--apply-chat-template
--rollout-shuffle
--rm-type math
--num-rollout 500
--rollout-batch-size 32
--n-samples-per-prompt 8
--rollout-max-response-len 1024
--rollout-temperature 1

--global-batch-size 256
)

EVAL_ARGS=(
--eval-interval 20
--eval-prompt-data gsm8k /root/gsm8k/test.parquet
--n-samples-per-eval-prompt 1
--eval-max-response-len 1024
--eval-top-k 1
)

PERF_ARGS=(
--tensor-model-parallel-size 1
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1

--use-dynamic-batch-size
--max-tokens-per-gpu 9216
)

GRPO_ARGS=(
--advantage-estimator grpo
--use-kl-loss
--kl-loss-coef 0.00
--kl-loss-type low_var_kl
--kl-coef 0.00
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)

OPTIMIZER_ARGS=(
--optimizer adam
--lr 1e-6
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)

WANDB_ARGS=(
--use-wandb
--wandb-host https://wandb.ai/
--wandb-entity samithuang
--wandb-project slime-rl
--wandb-group qwen2.5-0.5B-gsm8k-vllm
)

VLLM_ARGS=(
--rollout-backend vllm
--rollout-num-gpus-per-engine 1
--sglang-server-concurrency 512
--use-slime-router
--slime-router-middleware-paths slime.router.middleware_hub.radix_tree_middleware.RadixTreeMiddleware
)

MISC_ARGS=(
--attention-dropout 0.0
--hidden-dropout 0.0
--accumulate-allreduce-grads-in-fp32
--attention-softmax-in-fp32
--attention-backend flash
--deterministic-mode
)

ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats

ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{
"env_vars": {
"PYTHONPATH": "/root/Megatron-LM",
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"NCCL_ALGO": "Ring",
"NCCL_IB_DISABLE": "1",
"NCCL_P2P_DISABLE": "1",
"NCCL_SHM_DISABLE": "1",
"NCCL_NET_GDR_LEVEL": "0",
"NCCL_DEBUG": "INFO",
"NVTE_ALLOW_NONDETERMINISTIC_ALGO": "0",
"CUBLAS_WORKSPACE_CONFIG": ":4096:8"
}
}' \
-- python3 train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 1 \
--num-gpus-per-node 2 \
--rollout-num-gpus 1 \
--calculate-per-token-loss \
${MODEL_ARGS[@]} \
${CKPT_ARGS[@]} \
${ROLLOUT_ARGS[@]} \
${OPTIMIZER_ARGS[@]} \
${GRPO_ARGS[@]} \
${WANDB_ARGS[@]} \
${PERF_ARGS[@]} \
${EVAL_ARGS[@]} \
${VLLM_ARGS[@]} \
${MISC_ARGS[@]}
21 changes: 21 additions & 0 deletions setup_for_vllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
```
docker pull slimerl/slime:latest
```

```
docker run -itd --gpus all --ipc=host --shm-size=128g --net=host --privileged=true --restart=always \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docker run command uses the --privileged=true flag. This grants the container full access to the host system, which is a significant security risk. It should be avoided unless absolutely necessary. If it is required, please add a comment explaining why this level of privilege is needed.

--ulimit memlock=-1 --ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--name DNAME \
-it slimerl/slime:latest /bin/bash \

```
docker exec -it --user root DNAME bash
```

```
pip install vllm=0.16

# for compatibility
pip install numpy==1.26.4
```
3 changes: 2 additions & 1 deletion slime/backends/megatron_utils/actor.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,8 @@ def init(
if self.args.vocab_size is None:
self.args.vocab_size = self.tokenizer.vocab_size

update_weight_cls = UpdateWeightFromTensor if self.args.colocate else UpdateWeightFromDistributed
use_tensor_update = self.args.colocate and getattr(self.args, "rollout_backend", "sglang") != "vllm"
update_weight_cls = UpdateWeightFromTensor if use_tensor_update else UpdateWeightFromDistributed
Comment on lines +131 to +132
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic here seems to disable the use of UpdateWeightFromTensor for the vLLM backend, even when self.args.colocate is true. This forces the use of UpdateWeightFromDistributed for vLLM in all cases. This contradicts the RFCs (rfc-vllm-rollout-backend-en.md and rfc-vllm-rollout-backend.md), which state that colocate mode for vLLM should use a more efficient CUDA IPC-based weight transfer, typically handled within UpdateWeightFromTensor. This change effectively disables the performance optimization for colocate mode with vLLM.

self.weight_updater = update_weight_cls(
self.args,
self.model,
Expand Down
Loading