Skip to content

[Do not merge] Propose router design for vLLM#4

Open
knlnguyen1802 wants to merge 1 commit intoSamitHuang:dev_vllmfrom
knlnguyen1802:router_design
Open

[Do not merge] Propose router design for vLLM#4
knlnguyen1802 wants to merge 1 commit intoSamitHuang:dev_vllmfrom
knlnguyen1802:router_design

Conversation

@knlnguyen1802
Copy link
Copy Markdown
Collaborator

RFC: Replace SGLang Backend with vLLM — Router Integration


Summary

Replace the SGLang inference backend behind SlimeRouter with vLLM while keeping the existing router and middleware stack completely unchanged.
This RFC covers only the router layer — what APIs the vLLM backend must expose, how the existing SlimeRouter is reused, and what translation is needed between the two formats.

Key design decision: Reuse vLLM's built-in OpenAI-compatible API server (vllm serve)


1. Target Architecture

 Rollout Workers                    SlimeRouter (NO CHANGE)                vLLM Engines (NEW)
 ──────────────                    ────────────────────────                ──────────────────
                                   ┌──────────────────────┐
 POST /generate ──────────────────▶│ RadixTreeMiddleware   │
                                   │  • prefix cache       │
                                   │  • retry on abort     │
                                   │  • token/logprob cache│
                                   └──────────┬───────────┘
                                              │
                                   ┌──────────▼───────────┐
                                   │ SlimeRouter.proxy()   │         ┌─────────────────────┐
                                   │  • least-connections  │────────▶│ vLLM Translation    │
                                   │    load balancer      │         │ Sidecar (per engine) │
                                   │  • health check loop  │         │                     │
                                   └──────────────────────┘         │ POST /generate      │
                                                                     │   ↓ translate        │
                                                                     │ POST /v1/completions │
                                                                     │   ↓ translate back   │
                                                                     │ → SGLang-format JSON │
                                                                     └─────────┬───────────┘
                                                                               │
                                                                     ┌─────────▼───────────┐
                                                                     │ vLLM Server          │
                                                                     │ (vllm serve)         │
                                                                     │  • /v1/completions   │
                                                                     │  • /health           │
                                                                     │  • /sleep, /wake_up  │
                                                                     │  • /pause, /resume   │
                                                                     │  • /update_weights   │
                                                                     └─────────────────────┘

What stays the same

Component Change Reason
SlimeRouter (router.py) None Engine-agnostic HTTP proxy; only reads JSON responses
RadixTreeMiddleware (radix_tree_middleware.py) None Operates on request/response JSON; has no engine-specific code
StringRadixTrie (radix_tree.py) None Pure data structure, no engine coupling
Middleware loading (--slime-router-middleware-paths) None Dynamic import via load_function()

What is new

Component Description
vllm_translation_sidecar.py Lightweight FastAPI process co-located with each vLLM engine. Receives SGLang-format /generate requests, translates to vLLM's /v1/completions, translates responses back. Also proxies lifecycle endpoints (/abort_request, /health_generate, etc.).
vllm_engine.py Ray actor that manages the vLLM server process lifecycle (via vllm serve), the translation sidecar, weight updates, and registration with the router.

2. Reusing SlimeRouter — Zero Modification

The SlimeRouter communicates with backends through five interaction points. All are already engine-agnostic:

2.1 Worker Registration

Flow: Engine starts → engine calls POST /add_worker?url=http://{host}:{port} → router adds to pool.

Router state after registration:
  worker_request_counts["http://10.0.0.1:10090"] = 0
  worker_failure_counts["http://10.0.0.1:10090"] = 0

vLLM action: The VLLMEngine Ray actor calls this endpoint after verifying the vLLM server + translation sidecar are healthy. The registered URL points to the sidecar, not the raw vLLM server. No router change needed.

2.2 Request Proxying

Flow: POST /generate → middleware pipeline → SlimeRouter.proxy()httpx forwards to backend (sidecar).

The router selects a backend via least-connections (_use_url()), forwards the raw request body as-is, and returns the response as-is. It never inspects or transforms the request/response payload.

vLLM action: The sidecar receives the forwarded request, translates it to /v1/completions, calls the co-located vLLM server, translates the response back to SGLang format, and returns it.

2.3 Health Check

Flow: Background loop calls GET {worker_url}/health every N seconds.

  • 200 → healthy, reset failure count
  • Non-200 or timeout → increment failure count
  • Failures ≥ threshold (default 3) → quarantine worker permanently

vLLM action: The sidecar's /health proxies to vLLM's built-in /health endpoint (returns 200 when ready). Compatible out of the box.

2.4 Worker Listing

Flow: GET /list_workers → returns {"urls": [...]}

Used by the rollout to discover engines for direct abort calls. No engine involvement.

2.5 Retrieve from Text (Radix Tree)

Flow: POST /retrieve_from_text → router looks up the radix tree cache → returns tokens/logprobs.

Fully router-internal. Never reaches the engine.


3. API Contract — What the Translation Sidecar Must Expose

The translation sidecar sits between SlimeRouter and the vLLM server. It receives SGLang-format requests and returns SGLang-format responses.

3.1 POST /generate — Generation

This is the primary endpoint. The sidecar translates between Slime's format and vLLM's /v1/completions.

Incoming Request (from router)

{
  "input_ids": [128000, 2610, 553, 264, 11190, 18328, 13],
  "input_tokens": [128000, 2610, 553, 264, 11190, 18328, 13],
  "sampling_params": {
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": -1,
    "max_new_tokens": 1024,
    "stop": ["<|endoftext|>"],
    "stop_token_ids": [128001],
    "skip_special_tokens": false,
    "no_stop_trim": true,
    "spaces_between_special_tokens": false
  },
  "return_logprob": true,
  "stream": false
}

Translated Request (to vLLM /v1/completions)

{
  "model": "<model_name>",
  "prompt": [128000, 2610, 553, 264, 11190, 18328, 13],
  "max_tokens": 1024,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": -1,
  "stop": ["<|endoftext|>"],
  "stop_token_ids": [128001],
  "skip_special_tokens": false,
  "include_stop_str_in_output": true,
  "spaces_between_special_tokens": false,
  "logprobs": 1,
  "stream": false,
  "extra_body": {
    "return_token_ids": true
  }
}

Key translations:

  • input_idsprompt (vLLM accepts list[int] as pre-tokenized prompt)
  • max_new_tokensmax_tokens
  • no_stop_trim: trueinclude_stop_str_in_output: true
  • return_logprob: truelogprobs: 1 + extra_body.return_token_ids: true

vLLM Response (from /v1/completions)

{
  "id": "cmpl-abc123",
  "choices": [{
    "text": "I'll help you with that. The answer is 42.",
    "logprobs": {
      "token_logprobs": [-0.152, -0.089, -0.203],
      "tokens": ["I", "'ll", " help"]
    },
    "token_ids": [40, 3358, 1520],
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 3,
    "total_tokens": 10
  }
}

Translated Response (returned to router)

{
  "text": "I'll help you with that. The answer is 42.",
  "output_ids": [40, 3358, 1520],
  "meta_info": {
    "output_token_logprobs": [
      [-0.152, 40],
      [-0.089, 3358],
      [-0.203, 1520]
    ],
    "finish_reason": {
      "type": "stop"
    },
    "weight_version": 3,
    "prompt_tokens": 7,
    "cached_tokens": 0
  }
}
Field-by-field contract
Field Type Required Consumer Description
text str Yes Rollout, Middleware Generated text (output only, not including prompt)
output_ids list[int] Yes Middleware Generated token IDs. Middleware checks existence as a gate for caching.
meta_info.output_token_logprobs list[[float, int]] Yes (if return_logprob) Rollout, Middleware Each element is [logprob, token_id]. Used for RL policy ratio calculation.
meta_info.finish_reason {"type": str} Yes Rollout, Middleware Must be {"type": "stop"}, {"type": "length"}, or {"type": "abort"}. Not a plain string.
meta_info.weight_version int Yes Middleware, Rollout Current model weight version. Tracked by the sidecar (incremented on each weight update).
meta_info.prompt_tokens int Nice-to-have Rollout (stats) From usage.prompt_tokens.
meta_info.cached_tokens int Nice-to-have Rollout (stats) vLLM doesn't expose this directly; default to 0.

3.2 GET /health — Health Check

GET /health
→ Sidecar proxies to vLLM's GET /health
→ 200 OK        (engine ready)
→ 503 or timeout (engine not ready / overloaded)

vLLM already provides this endpoint. Passthrough — no translation needed.

3.3 POST /abort_request — Cancel Generation

POST /abort_request
Body: {"abort_all": true}
→ 200 OK

Called directly by the rollout to each engine (bypasses the router). The rollout discovers engine URLs via GET /list_workers, then sends abort to each.

vLLM approach: vLLM uses HTTP connection close for abort (via its @with_cancellation decorator). When a client disconnects, the in-flight request is automatically cancelled.

Implementation options:

  1. Track active connections. The sidecar maintains a set of active httpx connections to the vLLM server. On POST /abort_request, close all of them — triggering vLLM's cancellation.
  2. Use vLLM's /pause endpoint. Call POST /pause to block new requests, then POST /resume after the RL training step completes. This is semantically closer to how Slime uses abort (clearing the decks between training generations).

Note: vLLM has POST /abort_requests only in disaggregated mode. For standard mode, HTTP disconnect is the canonical abort mechanism.

3.4 GET /health_generate — Startup Readiness Probe

GET /health_generate
→ 200 OK        (model loaded, engine ready for generation)

Called by VLLMEngine.init() during startup to block until the engine is fully ready. The sidecar implements this by calling vLLM's GET /health and optionally performing a dummy /v1/completions call with max_tokens=1 to verify end-to-end readiness.

3.5 Sampling Params Translation

The request uses SGLang-format parameter names. The sidecar translates to vLLM's /v1/completions format:

SGLang field (in request) vLLM /v1/completions field Notes
input_ids prompt Direct — vLLM accepts list[int] as pre-tokenized prompt
temperature temperature Direct
top_p top_p Direct
top_k top_k Both use -1 for disabled
max_new_tokens max_tokens Name change
stop stop Direct (list of strings)
stop_token_ids stop_token_ids Direct
skip_special_tokens skip_special_tokens Direct
no_stop_trim include_stop_str_in_output Same semantics, different name
spaces_between_special_tokens spaces_between_special_tokens Direct
return_logprob logprobs (set to 1) Also add extra_body.return_token_ids = true
sampling_seed seed Optional
model Must be set to the model name served by vLLM

3.6 Response Translation Pseudocode

def translate_vllm_response(vllm_resp: dict, weight_version: int) -> dict:
    """Translate vLLM /v1/completions response to SGLang format."""
    choice = vllm_resp["choices"][0]
    usage = vllm_resp.get("usage", {})

    # Build output_token_logprobs: zip logprobs with token IDs
    output_token_logprobs = None
    if choice.get("logprobs") and choice.get("token_ids"):
        output_token_logprobs = [
            [logprob, token_id]
            for logprob, token_id in zip(
                choice["logprobs"]["token_logprobs"],
                choice["token_ids"]
            )
        ]

    # Translate finish_reason: plain string → {"type": str}
    raw_reason = choice.get("finish_reason")
    finish_reason = {"type": raw_reason if raw_reason else "abort"}

    return {
        "text": choice["text"],
        "output_ids": choice.get("token_ids", []),
        "meta_info": {
            "output_token_logprobs": output_token_logprobs,
            "finish_reason": finish_reason,
            "weight_version": weight_version,
            "prompt_tokens": usage.get("prompt_tokens", 0),
            "cached_tokens": 0,
        }
    }

3.7 finish_reason Translation Table

vLLM returns Translate to Notes
"stop" {"type": "stop"} Normal completion
"length" {"type": "length"} Hit max_tokens
None (aborted/incomplete) {"type": "abort"} Triggers middleware retry logic (sleep 30s, up to 5 retries)

4. Server Launch Configuration

The VLLMEngine Ray actor should launch vLLM as follows:

# Environment
export VLLM_SERVER_DEV_MODE=1

# Launch vLLM server
vllm serve <model_path> \
    --host 0.0.0.0 \
    --port <engine_port> \
    --tensor-parallel-size <tp_size> \
    --enable-sleep-mode \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --disable-log-requests

The translation sidecar runs on a separate port (<sidecar_port>) and is the URL registered with the router via POST /add_worker?url=http://{host}:{sidecar_port}.

                Router
                  │
                  ▼
    ┌─────────────────────────┐
    │ Translation Sidecar     │  ◄── registered with router
    │ port: sidecar_port      │
    │                         │
    │ /generate ──translate──▶│──┐
    │ /health ──passthrough──▶│  │
    │ /abort_request          │  │
    │ /health_generate        │  │
    └─────────────────────────┘  │
                                 │
    ┌─────────────────────────┐  │
    │ vLLM Server             │◄─┘
    │ port: engine_port       │
    │                         │
    │ /v1/completions         │
    │ /health                 │
    │ /sleep, /wake_up        │
    │ /pause, /resume         │
    │ /update_weights         │
    │ /init_weight_transfer   │
    └─────────────────────────┘

5. Abort Strategy — Detailed Design

vLLM's abort mechanism differs fundamentally from SGLang's:

Aspect SGLang vLLM
Abort granularity Per-request via POST /abort_request with rid Per-connection via HTTP disconnect
Bulk abort {"abort_all": true} No built-in equivalent
Mechanism Engine tracks request_id, explicit abort() @with_cancellation decorator; request cancelled when client disconnects
Between-generation abort Abort + restart POST /pause → training → POST /resume

Recommended implementation

For the Slime RL use case, the rollout calls abort_all between generation rounds (to clear the engine before the next batch). The best vLLM equivalent is:

# In the translation sidecar
@app.post("/abort_request")
async def abort_request(request: Request):
    body = await request.json()
    if body.get("abort_all"):
        # Option 1: Close all tracked httpx connections → triggers vLLM cancellation
        for conn in active_connections:
            await conn.aclose()
        active_connections.clear()

        # Option 2: Use pause/resume (cleaner)
        await httpx.post(f"{vllm_url}/pause")
        await httpx.post(f"{vllm_url}/resume")

    return {"status": "ok"}

6. Endpoints Summary — Gap Analysis

Engine-side endpoints (vLLM built-in vs. needs implementation)

Endpoint SGLang vLLM Built-in Action
POST /v1/completions Reuse — target for translation
GET /health Reuse as-is (passthrough)
POST /pause ✅ (dev mode) Reuse for abort/weight-update
POST /resume ✅ (dev mode) Reuse for abort/weight-update
POST /sleep ✅ (dev mode) Reuse for weight updates
POST /wake_up ✅ (dev mode) Reuse for weight updates
POST /collective_rpc ✅ (dev mode) Reuse for weight reload
GET /is_sleeping ✅ (dev mode) Reuse for state checks
POST /init_weight_transfer_engine ✅ (dev mode) Reuse for NCCL setup
POST /update_weights ✅ (dev mode) Reuse for NCCL weight apply
GET /get_world_size ✅ (dev mode) Reuse for TP world size

Translation sidecar endpoints (to implement)

Endpoint Description Complexity
POST /generate Translate SGLang → /v1/completions → SGLang Medium — main logic
GET /health Proxy to vLLM /health Trivial
GET /health_generate Health + optional dummy completion Low
POST /abort_request Close connections or pause/resume Low
GET /flush_cache POST /sleep?level=1 + POST /wake_up?tags=kv_cache Low
GET /get_weight_version Return sidecar-tracked version counter Trivial

Router endpoints (no change needed)

Endpoint Action
POST /add_worker No change
GET /list_workers No change
POST /retrieve_from_text No change
Catch-all proxy No change

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a detailed Request for Comments (RFC) outlining a design to integrate vLLM as the new inference backend for SlimeRouter, replacing the existing SGLang backend. The core principle of this proposal is to achieve this transition without modifying the current SlimeRouter or its middleware. This is accomplished by introducing a dedicated translation sidecar that interfaces with vLLM's OpenAI-compatible API server, ensuring seamless communication and operation between the systems.

Highlights

  • Backend Transition Proposal: Proposed replacing the SGLang inference backend with vLLM for SlimeRouter, aiming to enhance performance and capabilities.
  • Router Reusability: The design ensures that the existing SlimeRouter and its middleware stack remain completely unchanged, preserving current routing logic and features.
  • Translation Layer Introduction: A new vllm_translation_sidecar.py component is introduced to handle the translation of SGLang-format requests and responses to and from vLLM's OpenAI-compatible API.
  • New Component Definition: Defined new components, including the vllm_translation_sidecar.py and vllm_engine.py (a Ray actor), to manage the vLLM server lifecycle and integration.
  • Detailed API Contract: Outlined a comprehensive API contract for the translation sidecar, detailing request/response translation for the /generate endpoint and handling of lifecycle endpoints like /health and /abort_request.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/en/vllm/ROUTER_DESIGN.md
    • Added a detailed Request for Comments (RFC) document outlining the proposed design for integrating vLLM as the inference backend for SlimeRouter.
Activity
  • This pull request is a new proposal for discussion and has not yet undergone review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-structured design document (RFC) for integrating vLLM as a backend for SlimeRouter. The proposal is detailed and covers architecture, API contracts, and implementation strategies. My review focuses on ensuring the design's robustness and clarity, particularly regarding its assumptions about vLLM's behavior. I've identified a potentially critical issue in the proposed abort strategy and a few areas where more clarification would strengthen the design, such as the cache flushing mechanism and specific server configurations.

Note: Security Review has been skipped due to the limited scope of the PR.

Comment on lines +411 to +413
# Option 2: Use pause/resume (cleaner)
await httpx.post(f"{vllm_url}/pause")
await httpx.post(f"{vllm_url}/resume")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The proposed "Option 2" for implementing abort_all by calling /pause and /resume seems questionable. My understanding of vLLM's /pause endpoint is that it prevents new requests from being scheduled but does not terminate requests already in progress. This would not achieve the "abort all" semantic required by the rollout process.

Could you please verify the behavior of the /pause endpoint? If it doesn't abort in-flight requests, this design should strongly recommend Option 1 (tracking and closing HTTP connections), as that aligns with vLLM's standard cancellation mechanism. This is a critical part of the design to get right for ensuring the engine can be cleared between generation rounds.

Comment on lines +128 to +129
"input_ids": [128000, 2610, 553, 264, 11190, 18328, 13],
"input_tokens": [128000, 2610, 553, 264, 11190, 18328, 13],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for an incoming /generate request includes both input_ids and input_tokens with identical values. This could be a source of confusion or bugs if they were to ever diverge.

To improve clarity and robustness, it would be helpful to specify:

  • Are both fields always required?
  • If both are present and differ, which one takes precedence?
  • Could one of them be considered redundant and removed from the contract to simplify the API?

--port <engine_port> \
--tensor-parallel-size <tp_size> \
--enable-sleep-mode \
--enforce-eager \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The server launch configuration includes the --enforce-eager flag. This flag alters vLLM's memory management and can have performance implications. For completeness, could you add a brief note explaining why this flag is necessary or beneficial for the Slime RL use case? This would help future readers understand the rationale behind this specific configuration choice.

| `GET /health` | Proxy to vLLM `/health` | **Trivial** |
| `GET /health_generate` | Health + optional dummy completion | **Low** |
| `POST /abort_request` | Close connections or pause/resume | **Low** |
| `GET /flush_cache` | `POST /sleep?level=1` + `POST /wake_up?tags=kv_cache` | **Low** |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposed implementation for GET /flush_cache using POST /sleep?level=1 followed by POST /wake_up?tags=kv_cache is a bit confusing. The sleep command offloads the KV cache from the GPU, but wake_up with tags=kv_cache seems to imply reloading it. This sequence doesn't intuitively translate to "flushing" the cache.

Could you clarify the exact semantics of this operation? Does wake_up re-initialize an empty cache, effectively flushing it? Explaining this would improve the clarity of the design. If there's a more direct way to clear the KV cache in vLLM, that might be preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant