[Do not merge] Propose router design for vLLM#4
[Do not merge] Propose router design for vLLM#4knlnguyen1802 wants to merge 1 commit intoSamitHuang:dev_vllmfrom
Conversation
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a detailed Request for Comments (RFC) outlining a design to integrate vLLM as the new inference backend for SlimeRouter, replacing the existing SGLang backend. The core principle of this proposal is to achieve this transition without modifying the current SlimeRouter or its middleware. This is accomplished by introducing a dedicated translation sidecar that interfaces with vLLM's OpenAI-compatible API server, ensuring seamless communication and operation between the systems. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and well-structured design document (RFC) for integrating vLLM as a backend for SlimeRouter. The proposal is detailed and covers architecture, API contracts, and implementation strategies. My review focuses on ensuring the design's robustness and clarity, particularly regarding its assumptions about vLLM's behavior. I've identified a potentially critical issue in the proposed abort strategy and a few areas where more clarification would strengthen the design, such as the cache flushing mechanism and specific server configurations.
Note: Security Review has been skipped due to the limited scope of the PR.
| # Option 2: Use pause/resume (cleaner) | ||
| await httpx.post(f"{vllm_url}/pause") | ||
| await httpx.post(f"{vllm_url}/resume") |
There was a problem hiding this comment.
The proposed "Option 2" for implementing abort_all by calling /pause and /resume seems questionable. My understanding of vLLM's /pause endpoint is that it prevents new requests from being scheduled but does not terminate requests already in progress. This would not achieve the "abort all" semantic required by the rollout process.
Could you please verify the behavior of the /pause endpoint? If it doesn't abort in-flight requests, this design should strongly recommend Option 1 (tracking and closing HTTP connections), as that aligns with vLLM's standard cancellation mechanism. This is a critical part of the design to get right for ensuring the engine can be cleared between generation rounds.
| "input_ids": [128000, 2610, 553, 264, 11190, 18328, 13], | ||
| "input_tokens": [128000, 2610, 553, 264, 11190, 18328, 13], |
There was a problem hiding this comment.
The example for an incoming /generate request includes both input_ids and input_tokens with identical values. This could be a source of confusion or bugs if they were to ever diverge.
To improve clarity and robustness, it would be helpful to specify:
- Are both fields always required?
- If both are present and differ, which one takes precedence?
- Could one of them be considered redundant and removed from the contract to simplify the API?
| --port <engine_port> \ | ||
| --tensor-parallel-size <tp_size> \ | ||
| --enable-sleep-mode \ | ||
| --enforce-eager \ |
There was a problem hiding this comment.
The server launch configuration includes the --enforce-eager flag. This flag alters vLLM's memory management and can have performance implications. For completeness, could you add a brief note explaining why this flag is necessary or beneficial for the Slime RL use case? This would help future readers understand the rationale behind this specific configuration choice.
| | `GET /health` | Proxy to vLLM `/health` | **Trivial** | | ||
| | `GET /health_generate` | Health + optional dummy completion | **Low** | | ||
| | `POST /abort_request` | Close connections or pause/resume | **Low** | | ||
| | `GET /flush_cache` | `POST /sleep?level=1` + `POST /wake_up?tags=kv_cache` | **Low** | |
There was a problem hiding this comment.
The proposed implementation for GET /flush_cache using POST /sleep?level=1 followed by POST /wake_up?tags=kv_cache is a bit confusing. The sleep command offloads the KV cache from the GPU, but wake_up with tags=kv_cache seems to imply reloading it. This sequence doesn't intuitively translate to "flushing" the cache.
Could you clarify the exact semantics of this operation? Does wake_up re-initialize an empty cache, effectively flushing it? Explaining this would improve the clarity of the design. If there's a more direct way to clear the KV cache in vLLM, that might be preferable.
d644f19 to
be91022
Compare
RFC: Replace SGLang Backend with vLLM — Router Integration
Summary
Replace the SGLang inference backend behind SlimeRouter with vLLM while keeping the existing router and middleware stack completely unchanged.
This RFC covers only the router layer — what APIs the vLLM backend must expose, how the existing SlimeRouter is reused, and what translation is needed between the two formats.
Key design decision: Reuse vLLM's built-in OpenAI-compatible API server (
vllm serve)1. Target Architecture
What stays the same
SlimeRouter(router.py)RadixTreeMiddleware(radix_tree_middleware.py)StringRadixTrie(radix_tree.py)--slime-router-middleware-paths)load_function()What is new
vllm_translation_sidecar.py/generaterequests, translates to vLLM's/v1/completions, translates responses back. Also proxies lifecycle endpoints (/abort_request,/health_generate, etc.).vllm_engine.pyvllm serve), the translation sidecar, weight updates, and registration with the router.2. Reusing SlimeRouter — Zero Modification
The SlimeRouter communicates with backends through five interaction points. All are already engine-agnostic:
2.1 Worker Registration
Flow: Engine starts → engine calls
POST /add_worker?url=http://{host}:{port}→ router adds to pool.vLLM action: The
VLLMEngineRay actor calls this endpoint after verifying the vLLM server + translation sidecar are healthy. The registered URL points to the sidecar, not the raw vLLM server. No router change needed.2.2 Request Proxying
Flow:
POST /generate→ middleware pipeline →SlimeRouter.proxy()→httpxforwards to backend (sidecar).The router selects a backend via least-connections (
_use_url()), forwards the raw request body as-is, and returns the response as-is. It never inspects or transforms the request/response payload.vLLM action: The sidecar receives the forwarded request, translates it to
/v1/completions, calls the co-located vLLM server, translates the response back to SGLang format, and returns it.2.3 Health Check
Flow: Background loop calls
GET {worker_url}/healthevery N seconds.vLLM action: The sidecar's
/healthproxies to vLLM's built-in/healthendpoint (returns 200 when ready). Compatible out of the box.2.4 Worker Listing
Flow:
GET /list_workers→ returns{"urls": [...]}Used by the rollout to discover engines for direct abort calls. No engine involvement.
2.5 Retrieve from Text (Radix Tree)
Flow:
POST /retrieve_from_text→ router looks up the radix tree cache → returns tokens/logprobs.Fully router-internal. Never reaches the engine.
3. API Contract — What the Translation Sidecar Must Expose
The translation sidecar sits between SlimeRouter and the vLLM server. It receives SGLang-format requests and returns SGLang-format responses.
3.1
POST /generate— GenerationThis is the primary endpoint. The sidecar translates between Slime's format and vLLM's
/v1/completions.Incoming Request (from router)
{ "input_ids": [128000, 2610, 553, 264, 11190, 18328, 13], "input_tokens": [128000, 2610, 553, 264, 11190, 18328, 13], "sampling_params": { "temperature": 0.7, "top_p": 0.9, "top_k": -1, "max_new_tokens": 1024, "stop": ["<|endoftext|>"], "stop_token_ids": [128001], "skip_special_tokens": false, "no_stop_trim": true, "spaces_between_special_tokens": false }, "return_logprob": true, "stream": false }Translated Request (to vLLM
/v1/completions){ "model": "<model_name>", "prompt": [128000, 2610, 553, 264, 11190, 18328, 13], "max_tokens": 1024, "temperature": 0.7, "top_p": 0.9, "top_k": -1, "stop": ["<|endoftext|>"], "stop_token_ids": [128001], "skip_special_tokens": false, "include_stop_str_in_output": true, "spaces_between_special_tokens": false, "logprobs": 1, "stream": false, "extra_body": { "return_token_ids": true } }Key translations:
input_ids→prompt(vLLM acceptslist[int]as pre-tokenized prompt)max_new_tokens→max_tokensno_stop_trim: true→include_stop_str_in_output: truereturn_logprob: true→logprobs: 1+extra_body.return_token_ids: truevLLM Response (from
/v1/completions){ "id": "cmpl-abc123", "choices": [{ "text": "I'll help you with that. The answer is 42.", "logprobs": { "token_logprobs": [-0.152, -0.089, -0.203], "tokens": ["I", "'ll", " help"] }, "token_ids": [40, 3358, 1520], "finish_reason": "stop" }], "usage": { "prompt_tokens": 7, "completion_tokens": 3, "total_tokens": 10 } }Translated Response (returned to router)
{ "text": "I'll help you with that. The answer is 42.", "output_ids": [40, 3358, 1520], "meta_info": { "output_token_logprobs": [ [-0.152, 40], [-0.089, 3358], [-0.203, 1520] ], "finish_reason": { "type": "stop" }, "weight_version": 3, "prompt_tokens": 7, "cached_tokens": 0 } }Field-by-field contract
textstroutput_idslist[int]meta_info.output_token_logprobslist[[float, int]]return_logprob)[logprob, token_id]. Used for RL policy ratio calculation.meta_info.finish_reason{"type": str}{"type": "stop"},{"type": "length"}, or{"type": "abort"}. Not a plain string.meta_info.weight_versionintmeta_info.prompt_tokensintusage.prompt_tokens.meta_info.cached_tokensint0.3.2
GET /health— Health CheckvLLM already provides this endpoint. Passthrough — no translation needed.
3.3
POST /abort_request— Cancel GenerationCalled directly by the rollout to each engine (bypasses the router). The rollout discovers engine URLs via
GET /list_workers, then sends abort to each.vLLM approach: vLLM uses HTTP connection close for abort (via its
@with_cancellationdecorator). When a client disconnects, the in-flight request is automatically cancelled.Implementation options:
httpxconnections to the vLLM server. OnPOST /abort_request, close all of them — triggering vLLM's cancellation./pauseendpoint. CallPOST /pauseto block new requests, thenPOST /resumeafter the RL training step completes. This is semantically closer to how Slime uses abort (clearing the decks between training generations).3.4
GET /health_generate— Startup Readiness ProbeCalled by
VLLMEngine.init()during startup to block until the engine is fully ready. The sidecar implements this by calling vLLM'sGET /healthand optionally performing a dummy/v1/completionscall withmax_tokens=1to verify end-to-end readiness.3.5 Sampling Params Translation
The request uses SGLang-format parameter names. The sidecar translates to vLLM's
/v1/completionsformat:/v1/completionsfieldinput_idspromptlist[int]as pre-tokenized prompttemperaturetemperaturetop_ptop_ptop_ktop_k-1for disabledmax_new_tokensmax_tokensstopstopstop_token_idsstop_token_idsskip_special_tokensskip_special_tokensno_stop_triminclude_stop_str_in_outputspaces_between_special_tokensspaces_between_special_tokensreturn_logproblogprobs(set to1)extra_body.return_token_ids = truesampling_seedseedmodel3.6 Response Translation Pseudocode
3.7
finish_reasonTranslation Table"stop"{"type": "stop"}"length"{"type": "length"}max_tokensNone(aborted/incomplete){"type": "abort"}4. Server Launch Configuration
The
VLLMEngineRay actor should launch vLLM as follows:The translation sidecar runs on a separate port (
<sidecar_port>) and is the URL registered with the router viaPOST /add_worker?url=http://{host}:{sidecar_port}.5. Abort Strategy — Detailed Design
vLLM's abort mechanism differs fundamentally from SGLang's:
POST /abort_requestwithrid{"abort_all": true}request_id, explicitabort()@with_cancellationdecorator; request cancelled when client disconnectsPOST /pause→ training →POST /resumeRecommended implementation
For the Slime RL use case, the rollout calls
abort_allbetween generation rounds (to clear the engine before the next batch). The best vLLM equivalent is:6. Endpoints Summary — Gap Analysis
Engine-side endpoints (vLLM built-in vs. needs implementation)
POST /v1/completionsGET /healthPOST /pausePOST /resumePOST /sleepPOST /wake_upPOST /collective_rpcGET /is_sleepingPOST /init_weight_transfer_enginePOST /update_weightsGET /get_world_sizeTranslation sidecar endpoints (to implement)
POST /generate/v1/completions→ SGLangGET /health/healthGET /health_generatePOST /abort_requestGET /flush_cachePOST /sleep?level=1+POST /wake_up?tags=kv_cacheGET /get_weight_versionRouter endpoints (no change needed)
POST /add_workerGET /list_workersPOST /retrieve_from_text