Skip to content

[Don't Merge] Roadmap and Dev Plan#3

Draft
SamitHuang wants to merge 12 commits intomainfrom
dev_vllm
Draft

[Don't Merge] Roadmap and Dev Plan#3
SamitHuang wants to merge 12 commits intomainfrom
dev_vllm

Conversation

@SamitHuang
Copy link
Copy Markdown
Owner

@SamitHuang SamitHuang commented Mar 5, 2026

Dev Plan

P0:

Results

Qwen2.5-0.5B GRPO training convergence on GSM8K dataset:

image image

Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Add rollout backend client and test qwen2.5-0.5b non-colocate training
Signed-off-by: samithuang <285365963@qq.com>
Reorder weight synchronization support for colocate and non-colocate scenarios in the goal plan.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request outlines and partially implements a significant architectural enhancement to Slime, enabling the use of vLLM as an alternative rollout backend for Generative Reinforcement Learning with Policy Optimization (GRPO) training. The core objective is to maintain a stable training interface while introducing flexibility in the inference backend. This involves creating a clear abstraction layer for backend communication, managing the vLLM server lifecycle within the Ray ecosystem, and implementing a sophisticated weight synchronization strategy that adapts to different deployment scenarios, such as colocate and non-colocate GPU setups. The initial phase of this plan has been successfully validated, demonstrating the feasibility and stability of vLLM integration.

Highlights

  • vLLM Integration as Rollout Backend: Introduced vLLM as a first-class rollout backend in Slime, providing an alternative to SGLang for GRPO training workflows.
  • Backend Abstraction Layer: Implemented a backend-agnostic abstraction layer with RolloutBackendClient and canonical request/response types (RolloutBackendRequest, RolloutBackendResponse) to isolate protocol differences between SGLang and vLLM.
  • Managed vLLM Lifecycle: Enabled Slime to manage the vLLM server process lifecycle within Ray, including startup, health checks, and shutdown, mirroring the existing SGLang engine management.
  • Flexible Weight Synchronization: Designed a robust weight synchronization mechanism that automatically adapts to deployment modes: CUDA IPC for colocate mode (zero-copy) and NCCL broadcast for non-colocate mode (direct GPU transfer), ensuring compatibility with vLLM's native APIs.
  • NCCL Conflict Resolution: Addressed NCCL communicator conflicts between vLLM's PyNcclCommunicator and torch.distributed by isolating vLLM's NCCL operations within a dedicated _NcclBridge subprocess.
  • Phased Implementation and Verification: Completed and verified Phase 1 of the integration, successfully running GRPO training with Qwen2.5-0.5B and GSM8K on 4 GPUs in colocate mode using the vLLM backend.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • convert_qwen2.5_ckpt.sh
    • Added a new script to convert Qwen2.5-0.5B Hugging Face checkpoints to a torch distributed format.
  • docs/en/advanced/rfc-vllm-rollout-backend.md
    • Added a Request for Comments (RFC) document detailing the proposed architecture and phased plan for integrating vLLM as a rollout backend in Slime.
  • goal_plan.md
    • Added a Chinese-language document outlining the phased development plan for vLLM integration, including initial solutions and risks.
  • rfc-vllm-rollout-backend-en.md
    • Added an updated English RFC document for vLLM integration, marking Phase 1 as complete and detailing implementation challenges and solutions.
  • rfc-vllm-rollout-backend.md
    • Added an updated Chinese RFC document for vLLM integration, marking Phase 1 as complete and detailing implementation challenges and solutions.
  • run-qwen2.5-0.5B-reproducibility-noncolocate.sh
    • Added a shell script for running non-colocated Qwen2.5-0.5B GRPO training with SGLang.
  • run-qwen2.5-0.5B-vllm.sh
    • Added a shell script for validating Qwen2.5-0.5B GRPO training using the vLLM rollout backend in colocate mode.
  • setup_for_vllm.md
    • Added a markdown file providing setup instructions for vLLM, including Docker commands and Python package installations.
  • slime/backends/megatron_utils/actor.py
    • Modified the weight updater selection logic to conditionally use UpdateWeightFromTensor based on the rollout_backend and colocate arguments, specifically for vLLM.
  • slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py
    • Added a _NcclBridge subprocess to isolate vLLM's PyNcclCommunicator, preventing conflicts with torch.distributed NCCL groups.
    • Updated the update_weights method to support vLLM's packed weight synchronization protocol and utilize the _NcclBridge for NCCL operations.
  • slime/backends/vllm_utils/init.py
    • Added an __init__.py file to define the vllm_utils package and export VLLMEngine.
  • slime/backends/vllm_utils/vllm_engine.py
    • Added a VLLMEngine Ray actor responsible for launching, managing, and communicating with a vLLM server instance, including weight synchronization methods compatible with SGLangEngine.
  • slime/ray/actor_group.py
    • Updated the GPU allocation logic to restrict CUDA_VISIBLE_DEVICES for trainer processes, ensuring they only see their assigned GPUs and avoid conflicts with rollout GPUs.
  • slime/ray/rollout.py
    • Added _start_vllm_rollout_servers function to handle the startup and management of vLLM rollout servers.
    • Modified start_rollout_servers to dynamically select between SGLang and vLLM server startup based on the rollout_backend argument.
  • slime/rollout/backends/init.py
    • Added an __init__.py file to define the backends package and export the new client interfaces.
  • slime/rollout/backends/base_client.py
    • Added an abstract base class RolloutBackendClient and a BackendCapabilities dataclass to define a unified interface for different rollout backends.
  • slime/rollout/backends/sglang_client.py
    • Added SGLangClient to encapsulate SGLang-specific request and response handling, extracted from the general rollout logic.
  • slime/rollout/backends/vllm_client.py
    • Added VLLMClient to handle requests to the vLLM OpenAI-compatible API, including mapping sampling parameters and normalizing responses.
  • slime/rollout/base_types.py
    • Added RolloutBackendRequest and RolloutBackendResponse dataclasses to provide a backend-agnostic contract for rollout operations.
  • slime/rollout/sglang_rollout.py
    • Refactored the generate and abort functions to utilize the new RolloutBackendClient abstraction, allowing for dynamic backend selection.
    • Removed direct SGLang-specific HTTP calls and replaced them with calls to the chosen backend client.
  • slime/utils/arguments.py
    • Added new CLI arguments such as --rollout-backend, --vllm-base-url, --vllm-max-retries, and --vllm-weight-sync-packed for vLLM configuration.
    • Updated argument parsing to set SGLang-related alias defaults when vLLM is selected as the rollout backend.
Activity
  • This pull request is marked as a '[Don't Merge] Roadmap and Dev Plan', indicating it serves as a planning and tracking document rather than a standard code contribution for immediate merging. As such, there is no typical review activity or progress to report in terms of code changes being reviewed or merged.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@SamitHuang SamitHuang marked this pull request as draft March 5, 2026 16:34
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant architectural change by adding vLLM as a new rollout backend, laying a solid foundation for supporting multiple backends with a well-designed abstraction layer and necessary modifications for weight synchronization and server management. However, a security audit identified high-severity concerns regarding the exposure of unauthenticated control endpoints on all network interfaces and medium-severity risks related to the automatic execution of remote code from model repositories. It is recommended to restrict network exposure and disable trust_remote_code by default. Additionally, there are critical inconsistencies with RFCs, especially concerning the colocate mode for vLLM, and opportunities to improve maintainability and portability in scripts and documentation.

Comment on lines +131 to +132
use_tensor_update = self.args.colocate and getattr(self.args, "rollout_backend", "sglang") != "vllm"
update_weight_cls = UpdateWeightFromTensor if use_tensor_update else UpdateWeightFromDistributed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic here seems to disable the use of UpdateWeightFromTensor for the vLLM backend, even when self.args.colocate is true. This forces the use of UpdateWeightFromDistributed for vLLM in all cases. This contradicts the RFCs (rfc-vllm-rollout-backend-en.md and rfc-vllm-rollout-backend.md), which state that colocate mode for vLLM should use a more efficient CUDA IPC-based weight transfer, typically handled within UpdateWeightFromTensor. This change effectively disables the performance optimization for colocate mode with vLLM.

"--tensor-parallel-size", str(tp),
"--port", str(self.server_port),
"--host", "0.0.0.0",
"--weight-transfer-config", '{"backend": "nccl"}',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The vLLM server is configured to listen on all network interfaces (0.0.0.0) without any authentication, exposing sensitive control endpoints. This poses a high-severity risk as an attacker could manipulate model weights or disrupt the training process. It is recommended to bind the server to 127.0.0.1 or implement authentication if remote access is needed. Additionally, the weight-transfer-config is hardcoded to nccl, preventing the use of ipc for zero-copy weight transfer in colocate mode, which is a performance optimization outlined in the RFC.

        weight_transfer_backend = "ipc" if self.args.colocate else "nccl"
cmd.extend(["--weight-transfer-config", f'{{"backend": "{weight_transfer_backend}"}}'])

```

```
docker run -itd --gpus all --ipc=host --shm-size=128g --net=host --privileged=true --restart=always \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docker run command uses the --privileged=true flag. This grants the container full access to the host system, which is a significant security risk. It should be avoided unless absolutely necessary. If it is required, please add a comment explaining why this level of privilege is needed.

"--host", "0.0.0.0",
"--weight-transfer-config", '{"backend": "nccl"}',
"--seed", str(seed),
"--trust-remote-code",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The application enables trust_remote_code=True when launching the vLLM server. This setting allows the execution of arbitrary Python code from the model repository. An attacker who can influence the model path (e.g., via the --hf-checkpoint or --vllm-model CLI arguments) can achieve remote code execution by pointing the application to a malicious repository.

Recommendation: Disable trust_remote_code by default. If it is required for certain models, provide a mechanism for the user to explicitly enable it with a clear security warning.

if not output_token_ids and text:
logger.warning("vLLM response missing token_ids, falling back to tokenizer")
from slime.utils.processing_utils import load_tokenizer
tokenizer = load_tokenizer(self.args.hf_checkpoint, trust_remote_code=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The application enables trust_remote_code=True when loading the tokenizer in the VLLMClient. This allows the execution of arbitrary Python code from the model repository. An attacker who can influence the model path can achieve remote code execution by pointing the application to a malicious repository.

Recommendation: Disable trust_remote_code by default and only enable it if explicitly requested by the user.

@@ -0,0 +1,360 @@
# RFC: Add vLLM as a Rollout Backend in Slime
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This pull request introduces three separate RFC documents (docs/en/advanced/rfc-vllm-rollout-backend.md, rfc-vllm-rollout-backend-en.md, rfc-vllm-rollout-backend.md) covering the same feature. This can lead to confusion and make maintenance difficult. It's recommended to consolidate these into a single, canonical RFC and remove the outdated or redundant versions. For instance, this document appears to be an older draft compared to rfc-vllm-rollout-backend-en.md.

Comment on lines +23 to +28
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
--ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/
)

ROLLOUT_ARGS=(
--prompt-data /root/gsm8k/train.parquet
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script contains hardcoded absolute paths (e.g., /root/Qwen2.5-0.5B-Instruct/, /root/gsm8k/train.parquet). This reduces portability. Consider parameterizing these paths using environment variables or script arguments to make the script more reusable across different environments.

Comment on lines +25 to +31
--hf-checkpoint /root/Qwen2.5-0.5B-Instruct/
--ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/
)

# num-rollout:100
ROLLOUT_ARGS=(
--prompt-data /root/gsm8k/train.parquet
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This script contains hardcoded absolute paths for model checkpoints and datasets (e.g., /root/Qwen2.5-0.5B-Instruct/, /root/gsm8k/train.parquet). This is not portable. It would be better to define these paths as variables at the top of the script or pass them as arguments, which would make the script easier to adapt for different setups.

Comment on lines +67 to +68
env["NCCL_P2P_DISABLE"] = "1"
env.setdefault("NCCL_IB_DISABLE", "1")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These NCCL environment variables (NCCL_P2P_DISABLE, NCCL_IB_DISABLE) are set without any explanation. To improve maintainability, it would be beneficial to add comments explaining why these settings are necessary (e.g., for specific hardware configurations or to work around known issues).

        # Disable P2P and InfiniBand for compatibility with certain environments
        env["NCCL_P2P_DISABLE"] = "1"
        env.setdefault("NCCL_IB_DISABLE", "1")

output_token_ids = choice.get("token_ids") or []
logprobs_obj = choice.get("logprobs") or {}
raw_logprobs = logprobs_obj.get("token_logprobs") or []
output_token_logprobs = [float(lp) if lp is not None else 0.0 for lp in raw_logprobs]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a logprob value is None, it's being replaced with 0.0. While this is a reasonable default, a log probability of 0.0 corresponds to a probability of 1.0, which might be a valid value. This could lead to ambiguity. Consider using a very small number (e.g., -1e9) or float('-inf') to represent a missing logprob, or add a comment to clarify this behavior and its implications for downstream calculations like KL divergence.

* Draft router design

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Add vllm router

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Add router to script

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Fix gpu memory utilization

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Fix output token ids

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Add more nccl flag

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Fix bug

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

---------

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants