[Don't Merge] Roadmap and Dev Plan by SamitHuang · Pull Request #3 · SamitHuang/slime

SamitHuang · 2026-03-05T16:33:59Z

Dev Plan

P0:

Add VLLMClient rollout adapter Add rollout backend client and test qwen2.5-0.5b non-colocate training #1
Non-colocate mode
- with NCCL weight transfer (_NcclBridge subprocess isolates PyNcclCommunicator from torch.distributed. Now uses torch.multiprocessing for CUDA IPC) Eliminate intermediate CPU tensors #2
- Verify Qwen2.5-0.5B GRPO training convergence on GSM8K dataset
- Deterministic computing
- Support async training/rollout mode
Colocate mode (need vllm 0.17)
Integrate vLLM router [WIP] Router for vllm #5

Results

Qwen2.5-0.5B GRPO training convergence on GSM8K dataset:

Signed-off-by: SamitHuang <285365963@qq.com>

Signed-off-by: samithuang <285365963@qq.com>

Add rollout backend client and test qwen2.5-0.5b non-colocate training

Signed-off-by: samithuang <285365963@qq.com>

Eliminate intermediate CPU tensors

Reorder weight synchronization support for colocate and non-colocate scenarios in the goal plan.

gemini-code-assist · 2026-03-05T16:34:33Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request outlines and partially implements a significant architectural enhancement to Slime, enabling the use of vLLM as an alternative rollout backend for Generative Reinforcement Learning with Policy Optimization (GRPO) training. The core objective is to maintain a stable training interface while introducing flexibility in the inference backend. This involves creating a clear abstraction layer for backend communication, managing the vLLM server lifecycle within the Ray ecosystem, and implementing a sophisticated weight synchronization strategy that adapts to different deployment scenarios, such as colocate and non-colocate GPU setups. The initial phase of this plan has been successfully validated, demonstrating the feasibility and stability of vLLM integration.

Highlights

vLLM Integration as Rollout Backend: Introduced vLLM as a first-class rollout backend in Slime, providing an alternative to SGLang for GRPO training workflows.
Backend Abstraction Layer: Implemented a backend-agnostic abstraction layer with RolloutBackendClient and canonical request/response types (RolloutBackendRequest, RolloutBackendResponse) to isolate protocol differences between SGLang and vLLM.
Managed vLLM Lifecycle: Enabled Slime to manage the vLLM server process lifecycle within Ray, including startup, health checks, and shutdown, mirroring the existing SGLang engine management.
Flexible Weight Synchronization: Designed a robust weight synchronization mechanism that automatically adapts to deployment modes: CUDA IPC for colocate mode (zero-copy) and NCCL broadcast for non-colocate mode (direct GPU transfer), ensuring compatibility with vLLM's native APIs.
NCCL Conflict Resolution: Addressed NCCL communicator conflicts between vLLM's PyNcclCommunicator and torch.distributed by isolating vLLM's NCCL operations within a dedicated _NcclBridge subprocess.
Phased Implementation and Verification: Completed and verified Phase 1 of the integration, successfully running GRPO training with Qwen2.5-0.5B and GSM8K on 4 GPUs in colocate mode using the vLLM backend.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

convert_qwen2.5_ckpt.sh
- Added a new script to convert Qwen2.5-0.5B Hugging Face checkpoints to a torch distributed format.
docs/en/advanced/rfc-vllm-rollout-backend.md
- Added a Request for Comments (RFC) document detailing the proposed architecture and phased plan for integrating vLLM as a rollout backend in Slime.
goal_plan.md
- Added a Chinese-language document outlining the phased development plan for vLLM integration, including initial solutions and risks.
rfc-vllm-rollout-backend-en.md
- Added an updated English RFC document for vLLM integration, marking Phase 1 as complete and detailing implementation challenges and solutions.
rfc-vllm-rollout-backend.md
- Added an updated Chinese RFC document for vLLM integration, marking Phase 1 as complete and detailing implementation challenges and solutions.
run-qwen2.5-0.5B-reproducibility-noncolocate.sh
- Added a shell script for running non-colocated Qwen2.5-0.5B GRPO training with SGLang.
run-qwen2.5-0.5B-vllm.sh
- Added a shell script for validating Qwen2.5-0.5B GRPO training using the vLLM rollout backend in colocate mode.
setup_for_vllm.md
- Added a markdown file providing setup instructions for vLLM, including Docker commands and Python package installations.
slime/backends/megatron_utils/actor.py
- Modified the weight updater selection logic to conditionally use UpdateWeightFromTensor based on the rollout_backend and colocate arguments, specifically for vLLM.
slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py
- Added a _NcclBridge subprocess to isolate vLLM's PyNcclCommunicator, preventing conflicts with torch.distributed NCCL groups.
- Updated the update_weights method to support vLLM's packed weight synchronization protocol and utilize the _NcclBridge for NCCL operations.
slime/backends/vllm_utils/init.py
- Added an __init__.py file to define the vllm_utils package and export VLLMEngine.
slime/backends/vllm_utils/vllm_engine.py
- Added a VLLMEngine Ray actor responsible for launching, managing, and communicating with a vLLM server instance, including weight synchronization methods compatible with SGLangEngine.
slime/ray/actor_group.py
- Updated the GPU allocation logic to restrict CUDA_VISIBLE_DEVICES for trainer processes, ensuring they only see their assigned GPUs and avoid conflicts with rollout GPUs.
slime/ray/rollout.py
- Added _start_vllm_rollout_servers function to handle the startup and management of vLLM rollout servers.
- Modified start_rollout_servers to dynamically select between SGLang and vLLM server startup based on the rollout_backend argument.
slime/rollout/backends/init.py
- Added an __init__.py file to define the backends package and export the new client interfaces.
slime/rollout/backends/base_client.py
- Added an abstract base class RolloutBackendClient and a BackendCapabilities dataclass to define a unified interface for different rollout backends.
slime/rollout/backends/sglang_client.py
- Added SGLangClient to encapsulate SGLang-specific request and response handling, extracted from the general rollout logic.
slime/rollout/backends/vllm_client.py
- Added VLLMClient to handle requests to the vLLM OpenAI-compatible API, including mapping sampling parameters and normalizing responses.
slime/rollout/base_types.py
- Added RolloutBackendRequest and RolloutBackendResponse dataclasses to provide a backend-agnostic contract for rollout operations.
slime/rollout/sglang_rollout.py
- Refactored the generate and abort functions to utilize the new RolloutBackendClient abstraction, allowing for dynamic backend selection.
- Removed direct SGLang-specific HTTP calls and replaced them with calls to the chosen backend client.
slime/utils/arguments.py
- Added new CLI arguments such as --rollout-backend, --vllm-base-url, --vllm-max-retries, and --vllm-weight-sync-packed for vLLM configuration.
- Updated argument parsing to set SGLang-related alias defaults when vLLM is selected as the rollout backend.

Activity

This pull request is marked as a '[Don't Merge] Roadmap and Dev Plan', indicating it serves as a planning and tracking document rather than a standard code contribution for immediate merging. As such, there is no typical review activity or progress to report in terms of code changes being reviewed or merged.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant architectural change by adding vLLM as a new rollout backend, laying a solid foundation for supporting multiple backends with a well-designed abstraction layer and necessary modifications for weight synchronization and server management. However, a security audit identified high-severity concerns regarding the exposure of unauthenticated control endpoints on all network interfaces and medium-severity risks related to the automatic execution of remote code from model repositories. It is recommended to restrict network exposure and disable trust_remote_code by default. Additionally, there are critical inconsistencies with RFCs, especially concerning the colocate mode for vLLM, and opportunities to improve maintainability and portability in scripts and documentation.

gemini-code-assist · 2026-03-05T16:39:20Z

slime/backends/megatron_utils/actor.py

+        use_tensor_update = self.args.colocate and getattr(self.args, "rollout_backend", "sglang") != "vllm"
+        update_weight_cls = UpdateWeightFromTensor if use_tensor_update else UpdateWeightFromDistributed


The logic here seems to disable the use of UpdateWeightFromTensor for the vLLM backend, even when self.args.colocate is true. This forces the use of UpdateWeightFromDistributed for vLLM in all cases. This contradicts the RFCs (rfc-vllm-rollout-backend-en.md and rfc-vllm-rollout-backend.md), which state that colocate mode for vLLM should use a more efficient CUDA IPC-based weight transfer, typically handled within UpdateWeightFromTensor. This change effectively disables the performance optimization for colocate mode with vLLM.

gemini-code-assist · 2026-03-05T16:39:20Z

slime/backends/vllm_utils/vllm_engine.py

+            "--tensor-parallel-size", str(tp),
+            "--port", str(self.server_port),
+            "--host", "0.0.0.0",
+            "--weight-transfer-config", '{"backend": "nccl"}',


The vLLM server is configured to listen on all network interfaces (0.0.0.0) without any authentication, exposing sensitive control endpoints. This poses a high-severity risk as an attacker could manipulate model weights or disrupt the training process. It is recommended to bind the server to 127.0.0.1 or implement authentication if remote access is needed. Additionally, the weight-transfer-config is hardcoded to nccl, preventing the use of ipc for zero-copy weight transfer in colocate mode, which is a performance optimization outlined in the RFC.

weight_transfer_backend = "ipc" if self.args.colocate else "nccl" cmd.extend(["--weight-transfer-config", f'{{"backend": "{weight_transfer_backend}"}}'])

gemini-code-assist · 2026-03-05T16:39:20Z

setup_for_vllm.md

+```
+
+```
+docker run -itd --gpus all --ipc=host --shm-size=128g --net=host --privileged=true  --restart=always \


The docker run command uses the --privileged=true flag. This grants the container full access to the host system, which is a significant security risk. It should be avoided unless absolutely necessary. If it is required, please add a comment explaining why this level of privilege is needed.

gemini-code-assist · 2026-03-05T16:39:20Z

slime/backends/vllm_utils/vllm_engine.py

+            "--host", "0.0.0.0",
+            "--weight-transfer-config", '{"backend": "nccl"}',
+            "--seed", str(seed),
+            "--trust-remote-code",


The application enables trust_remote_code=True when launching the vLLM server. This setting allows the execution of arbitrary Python code from the model repository. An attacker who can influence the model path (e.g., via the --hf-checkpoint or --vllm-model CLI arguments) can achieve remote code execution by pointing the application to a malicious repository.

Recommendation: Disable trust_remote_code by default. If it is required for certain models, provide a mechanism for the user to explicitly enable it with a clear security warning.

gemini-code-assist · 2026-03-05T16:39:20Z

slime/rollout/backends/vllm_client.py

+        if not output_token_ids and text:
+            logger.warning("vLLM response missing token_ids, falling back to tokenizer")
+            from slime.utils.processing_utils import load_tokenizer
+            tokenizer = load_tokenizer(self.args.hf_checkpoint, trust_remote_code=True)


The application enables trust_remote_code=True when loading the tokenizer in the VLLMClient. This allows the execution of arbitrary Python code from the model repository. An attacker who can influence the model path can achieve remote code execution by pointing the application to a malicious repository.

Recommendation: Disable trust_remote_code by default and only enable it if explicitly requested by the user.

gemini-code-assist · 2026-03-05T16:39:20Z

docs/en/advanced/rfc-vllm-rollout-backend.md

@@ -0,0 +1,360 @@
+# RFC: Add vLLM as a Rollout Backend in Slime


This pull request introduces three separate RFC documents (docs/en/advanced/rfc-vllm-rollout-backend.md, rfc-vllm-rollout-backend-en.md, rfc-vllm-rollout-backend.md) covering the same feature. This can lead to confusion and make maintenance difficult. It's recommended to consolidate these into a single, canonical RFC and remove the outdated or redundant versions. For instance, this document appears to be an older draft compared to rfc-vllm-rollout-backend-en.md.

gemini-code-assist · 2026-03-05T16:39:20Z

run-qwen2.5-0.5B-reproducibility-noncolocate.sh

+   --hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
+   --ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/
+)
+
+ROLLOUT_ARGS=(
+   --prompt-data /root/gsm8k/train.parquet


The script contains hardcoded absolute paths (e.g., /root/Qwen2.5-0.5B-Instruct/, /root/gsm8k/train.parquet). This reduces portability. Consider parameterizing these paths using environment variables or script arguments to make the script more reusable across different environments.

gemini-code-assist · 2026-03-05T16:39:20Z

run-qwen2.5-0.5B-vllm.sh

+   --hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
+   --ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/
+)
+
+# num-rollout:100
+ROLLOUT_ARGS=(
+   --prompt-data /root/gsm8k/train.parquet


This script contains hardcoded absolute paths for model checkpoints and datasets (e.g., /root/Qwen2.5-0.5B-Instruct/, /root/gsm8k/train.parquet). This is not portable. It would be better to define these paths as variables at the top of the script or pass them as arguments, which would make the script easier to adapt for different setups.

gemini-code-assist · 2026-03-05T16:39:21Z

slime/backends/vllm_utils/vllm_engine.py

+        env["NCCL_P2P_DISABLE"] = "1"
+        env.setdefault("NCCL_IB_DISABLE", "1")


These NCCL environment variables (NCCL_P2P_DISABLE, NCCL_IB_DISABLE) are set without any explanation. To improve maintainability, it would be beneficial to add comments explaining why these settings are necessary (e.g., for specific hardware configurations or to work around known issues).

# Disable P2P and InfiniBand for compatibility with certain environments env["NCCL_P2P_DISABLE"] = "1" env.setdefault("NCCL_IB_DISABLE", "1")

gemini-code-assist · 2026-03-05T16:39:21Z

slime/rollout/backends/vllm_client.py

+        output_token_ids = choice.get("token_ids") or []
+        logprobs_obj = choice.get("logprobs") or {}
+        raw_logprobs = logprobs_obj.get("token_logprobs") or []
+        output_token_logprobs = [float(lp) if lp is not None else 0.0 for lp in raw_logprobs]


When a logprob value is None, it's being replaced with 0.0. While this is a reasonable default, a log probability of 0.0 corresponds to a probability of 1.0, which might be a valid value. This could lead to ambiguity. Consider using a very small number (e.g., -1e9) or float('-inf') to represent a missing logprob, or add a comment to clarify this behavior and its implications for downstream calculations like KL divergence.

* Draft router design Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add vllm router Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add router to script Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix gpu memory utilization Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix output token ids Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add more nccl flag Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> --------- Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

SamitHuang added 11 commits March 2, 2026 16:32

temp save rfc

08ce80b

Signed-off-by: SamitHuang <285365963@qq.com>

add plan

3af806e

Signed-off-by: SamitHuang <285365963@qq.com>

update

48fbde3

Signed-off-by: SamitHuang <285365963@qq.com>

qwen2.5 0.5b non-colocate (first attempt ok, but nccl error later)

f8ceed6

Signed-off-by: samithuang <285365963@qq.com>

add convert script

2caa4a0

add setup doc

8caa8ba

fix nccl error by NcclBridge subprocess

25ee005

Add rollout backend client and test qwen2.5-0.5b non-colocate training

09f534a

Add rollout backend client and test qwen2.5-0.5b non-colocate training

eliminate gpu to cpu weight transfer

ab7eb0b

Signed-off-by: samithuang <285365963@qq.com>

Eliminate intermediate CPU tensors for faster weight transfer

411e2d2

Eliminate intermediate CPU tensors

Revise weight synchronization strategy in goal plan

546d2ad

Reorder weight synchronization support for colocate and non-colocate scenarios in the goal plan.

SamitHuang marked this pull request as draft March 5, 2026 16:34

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't Merge] Roadmap and Dev Plan#3

[Don't Merge] Roadmap and Dev Plan#3
SamitHuang wants to merge 12 commits intomainfrom
dev_vllm

SamitHuang commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		use_tensor_update = self.args.colocate and getattr(self.args, "rollout_backend", "sglang") != "vllm"
		update_weight_cls = UpdateWeightFromTensor if use_tensor_update else UpdateWeightFromDistributed

		@@ -0,0 +1,360 @@
		# RFC: Add vLLM as a Rollout Backend in Slime

		env["NCCL_P2P_DISABLE"] = "1"
		env.setdefault("NCCL_IB_DISABLE", "1")

Conversation

SamitHuang commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dev Plan

Results

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SamitHuang commented Mar 5, 2026 •

edited

Loading