[Do not merge] Propose router design for vLLM by knlnguyen1802 · Pull Request #4 · SamitHuang/slime

knlnguyen1802 · 2026-03-06T03:19:39Z

RFC: Replace SGLang Backend with vLLM — Router Integration

Summary

Replace the SGLang inference backend behind SlimeRouter with vLLM while keeping the existing router and middleware stack completely unchanged.
This RFC covers only the router layer — what APIs the vLLM backend must expose, how the existing SlimeRouter is reused, and what translation is needed between the two formats.

Key design decision: Reuse vLLM's built-in OpenAI-compatible API server (vllm serve)

1. Target Architecture

 Rollout Workers                    SlimeRouter (NO CHANGE)                vLLM Engines (NEW)
 ──────────────                    ────────────────────────                ──────────────────
                                   ┌──────────────────────┐
 POST /generate ──────────────────▶│ RadixTreeMiddleware   │
                                   │  • prefix cache       │
                                   │  • retry on abort     │
                                   │  • token/logprob cache│
                                   └──────────┬───────────┘
                                              │
                                   ┌──────────▼───────────┐
                                   │ SlimeRouter.proxy()   │         ┌─────────────────────┐
                                   │  • least-connections  │────────▶│ vLLM Translation    │
                                   │    load balancer      │         │ Sidecar (per engine) │
                                   │  • health check loop  │         │                     │
                                   └──────────────────────┘         │ POST /generate      │
                                                                     │   ↓ translate        │
                                                                     │ POST /v1/completions │
                                                                     │   ↓ translate back   │
                                                                     │ → SGLang-format JSON │
                                                                     └─────────┬───────────┘
                                                                               │
                                                                     ┌─────────▼───────────┐
                                                                     │ vLLM Server          │
                                                                     │ (vllm serve)         │
                                                                     │  • /v1/completions   │
                                                                     │  • /health           │
                                                                     │  • /sleep, /wake_up  │
                                                                     │  • /pause, /resume   │
                                                                     │  • /update_weights   │
                                                                     └─────────────────────┘

What stays the same

Component	Change	Reason
`SlimeRouter` (router.py)	None	Engine-agnostic HTTP proxy; only reads JSON responses
`RadixTreeMiddleware` (radix_tree_middleware.py)	None	Operates on request/response JSON; has no engine-specific code
`StringRadixTrie` (radix_tree.py)	None	Pure data structure, no engine coupling
Middleware loading (`--slime-router-middleware-paths`)	None	Dynamic import via `load_function()`

What is new

Component	Description
`vllm_translation_sidecar.py`	Lightweight FastAPI process co-located with each vLLM engine. Receives SGLang-format `/generate` requests, translates to vLLM's `/v1/completions`, translates responses back. Also proxies lifecycle endpoints (`/abort_request`, `/health_generate`, etc.).
`vllm_engine.py`	Ray actor that manages the vLLM server process lifecycle (via `vllm serve`), the translation sidecar, weight updates, and registration with the router.

2. Reusing SlimeRouter — Zero Modification

The SlimeRouter communicates with backends through five interaction points. All are already engine-agnostic:

2.1 Worker Registration

Flow: Engine starts → engine calls POST /add_worker?url=http://{host}:{port} → router adds to pool.

Router state after registration:
  worker_request_counts["http://10.0.0.1:10090"] = 0
  worker_failure_counts["http://10.0.0.1:10090"] = 0

vLLM action: The VLLMEngine Ray actor calls this endpoint after verifying the vLLM server + translation sidecar are healthy. The registered URL points to the sidecar, not the raw vLLM server. No router change needed.

2.2 Request Proxying

Flow: POST /generate → middleware pipeline → SlimeRouter.proxy() → httpx forwards to backend (sidecar).

The router selects a backend via least-connections (_use_url()), forwards the raw request body as-is, and returns the response as-is. It never inspects or transforms the request/response payload.

vLLM action: The sidecar receives the forwarded request, translates it to /v1/completions, calls the co-located vLLM server, translates the response back to SGLang format, and returns it.

2.3 Health Check

Flow: Background loop calls GET {worker_url}/health every N seconds.

200 → healthy, reset failure count
Non-200 or timeout → increment failure count
Failures ≥ threshold (default 3) → quarantine worker permanently

vLLM action: The sidecar's /health proxies to vLLM's built-in /health endpoint (returns 200 when ready). Compatible out of the box.

2.4 Worker Listing

Flow: GET /list_workers → returns {"urls": [...]}

Used by the rollout to discover engines for direct abort calls. No engine involvement.

2.5 Retrieve from Text (Radix Tree)

Flow: POST /retrieve_from_text → router looks up the radix tree cache → returns tokens/logprobs.

Fully router-internal. Never reaches the engine.

3. API Contract — What the Translation Sidecar Must Expose

The translation sidecar sits between SlimeRouter and the vLLM server. It receives SGLang-format requests and returns SGLang-format responses.

3.1 `POST /generate` — Generation

This is the primary endpoint. The sidecar translates between Slime's format and vLLM's /v1/completions.

Incoming Request (from router)

{
  "input_ids": [128000, 2610, 553, 264, 11190, 18328, 13],
  "input_tokens": [128000, 2610, 553, 264, 11190, 18328, 13],
  "sampling_params": {
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": -1,
    "max_new_tokens": 1024,
    "stop": ["<|endoftext|>"],
    "stop_token_ids": [128001],
    "skip_special_tokens": false,
    "no_stop_trim": true,
    "spaces_between_special_tokens": false
  },
  "return_logprob": true,
  "stream": false
}

Translated Request (to vLLM `/v1/completions`)

{
  "model": "<model_name>",
  "prompt": [128000, 2610, 553, 264, 11190, 18328, 13],
  "max_tokens": 1024,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": -1,
  "stop": ["<|endoftext|>"],
  "stop_token_ids": [128001],
  "skip_special_tokens": false,
  "include_stop_str_in_output": true,
  "spaces_between_special_tokens": false,
  "logprobs": 1,
  "stream": false,
  "extra_body": {
    "return_token_ids": true
  }
}

Key translations:

input_ids → prompt (vLLM accepts list[int] as pre-tokenized prompt)
max_new_tokens → max_tokens
no_stop_trim: true → include_stop_str_in_output: true
return_logprob: true → logprobs: 1 + extra_body.return_token_ids: true

vLLM Response (from `/v1/completions`)

{
  "id": "cmpl-abc123",
  "choices": [{
    "text": "I'll help you with that. The answer is 42.",
    "logprobs": {
      "token_logprobs": [-0.152, -0.089, -0.203],
      "tokens": ["I", "'ll", " help"]
    },
    "token_ids": [40, 3358, 1520],
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 3,
    "total_tokens": 10
  }
}

Translated Response (returned to router)

{
  "text": "I'll help you with that. The answer is 42.",
  "output_ids": [40, 3358, 1520],
  "meta_info": {
    "output_token_logprobs": [
      [-0.152, 40],
      [-0.089, 3358],
      [-0.203, 1520]
    ],
    "finish_reason": {
      "type": "stop"
    },
    "weight_version": 3,
    "prompt_tokens": 7,
    "cached_tokens": 0
  }
}

Field-by-field contract

Field	Type	Required	Consumer	Description
`text`	`str`	Yes	Rollout, Middleware	Generated text (output only, not including prompt)
`output_ids`	`list[int]`	Yes	Middleware	Generated token IDs. Middleware checks existence as a gate for caching.
`meta_info.output_token_logprobs`	`list[[float, int]]`	Yes (if `return_logprob`)	Rollout, Middleware	Each element is `[logprob, token_id]`. Used for RL policy ratio calculation.
`meta_info.finish_reason`	`{"type": str}`	Yes	Rollout, Middleware	Must be `{"type": "stop"}`, `{"type": "length"}`, or `{"type": "abort"}`. Not a plain string.
`meta_info.weight_version`	`int`	Yes	Middleware, Rollout	Current model weight version. Tracked by the sidecar (incremented on each weight update).
`meta_info.prompt_tokens`	`int`	Nice-to-have	Rollout (stats)	From `usage.prompt_tokens`.
`meta_info.cached_tokens`	`int`	Nice-to-have	Rollout (stats)	vLLM doesn't expose this directly; default to `0`.

3.2 `GET /health` — Health Check

GET /health
→ Sidecar proxies to vLLM's GET /health
→ 200 OK        (engine ready)
→ 503 or timeout (engine not ready / overloaded)

vLLM already provides this endpoint. Passthrough — no translation needed.

3.3 `POST /abort_request` — Cancel Generation

POST /abort_request
Body: {"abort_all": true}
→ 200 OK

Called directly by the rollout to each engine (bypasses the router). The rollout discovers engine URLs via GET /list_workers, then sends abort to each.

vLLM approach: vLLM uses HTTP connection close for abort (via its @with_cancellation decorator). When a client disconnects, the in-flight request is automatically cancelled.

Implementation options:

Track active connections. The sidecar maintains a set of active httpx connections to the vLLM server. On POST /abort_request, close all of them — triggering vLLM's cancellation.
Use vLLM's /pause endpoint. Call POST /pause to block new requests, then POST /resume after the RL training step completes. This is semantically closer to how Slime uses abort (clearing the decks between training generations).

Note: vLLM has POST /abort_requests only in disaggregated mode. For standard mode, HTTP disconnect is the canonical abort mechanism.

3.4 `GET /health_generate` — Startup Readiness Probe

GET /health_generate
→ 200 OK        (model loaded, engine ready for generation)

Called by VLLMEngine.init() during startup to block until the engine is fully ready. The sidecar implements this by calling vLLM's GET /health and optionally performing a dummy /v1/completions call with max_tokens=1 to verify end-to-end readiness.

3.5 Sampling Params Translation

The request uses SGLang-format parameter names. The sidecar translates to vLLM's /v1/completions format:

SGLang field (in request)	vLLM `/v1/completions` field	Notes
`input_ids`	`prompt`	Direct — vLLM accepts `list[int]` as pre-tokenized prompt
`temperature`	`temperature`	Direct
`top_p`	`top_p`	Direct
`top_k`	`top_k`	Both use `-1` for disabled
`max_new_tokens`	`max_tokens`	Name change
`stop`	`stop`	Direct (list of strings)
`stop_token_ids`	`stop_token_ids`	Direct
`skip_special_tokens`	`skip_special_tokens`	Direct
`no_stop_trim`	`include_stop_str_in_output`	Same semantics, different name
`spaces_between_special_tokens`	`spaces_between_special_tokens`	Direct
`return_logprob`	`logprobs` (set to `1`)	Also add `extra_body.return_token_ids = true`
`sampling_seed`	`seed`	Optional
—	`model`	Must be set to the model name served by vLLM

3.6 Response Translation Pseudocode

def translate_vllm_response(vllm_resp: dict, weight_version: int) -> dict:
    """Translate vLLM /v1/completions response to SGLang format."""
    choice = vllm_resp["choices"][0]
    usage = vllm_resp.get("usage", {})

    # Build output_token_logprobs: zip logprobs with token IDs
    output_token_logprobs = None
    if choice.get("logprobs") and choice.get("token_ids"):
        output_token_logprobs = [
            [logprob, token_id]
            for logprob, token_id in zip(
                choice["logprobs"]["token_logprobs"],
                choice["token_ids"]
            )
        ]

    # Translate finish_reason: plain string → {"type": str}
    raw_reason = choice.get("finish_reason")
    finish_reason = {"type": raw_reason if raw_reason else "abort"}

    return {
        "text": choice["text"],
        "output_ids": choice.get("token_ids", []),
        "meta_info": {
            "output_token_logprobs": output_token_logprobs,
            "finish_reason": finish_reason,
            "weight_version": weight_version,
            "prompt_tokens": usage.get("prompt_tokens", 0),
            "cached_tokens": 0,
        }
    }

3.7 `finish_reason` Translation Table

vLLM returns	Translate to	Notes
`"stop"`	`{"type": "stop"}`	Normal completion
`"length"`	`{"type": "length"}`	Hit `max_tokens`
`None` (aborted/incomplete)	`{"type": "abort"}`	Triggers middleware retry logic (sleep 30s, up to 5 retries)

4. Server Launch Configuration

The VLLMEngine Ray actor should launch vLLM as follows:

# Environment
export VLLM_SERVER_DEV_MODE=1

# Launch vLLM server
vllm serve <model_path> \
    --host 0.0.0.0 \
    --port <engine_port> \
    --tensor-parallel-size <tp_size> \
    --enable-sleep-mode \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --disable-log-requests

The translation sidecar runs on a separate port (<sidecar_port>) and is the URL registered with the router via POST /add_worker?url=http://{host}:{sidecar_port}.

                Router
                  │
                  ▼
    ┌─────────────────────────┐
    │ Translation Sidecar     │  ◄── registered with router
    │ port: sidecar_port      │
    │                         │
    │ /generate ──translate──▶│──┐
    │ /health ──passthrough──▶│  │
    │ /abort_request          │  │
    │ /health_generate        │  │
    └─────────────────────────┘  │
                                 │
    ┌─────────────────────────┐  │
    │ vLLM Server             │◄─┘
    │ port: engine_port       │
    │                         │
    │ /v1/completions         │
    │ /health                 │
    │ /sleep, /wake_up        │
    │ /pause, /resume         │
    │ /update_weights         │
    │ /init_weight_transfer   │
    └─────────────────────────┘

5. Abort Strategy — Detailed Design

vLLM's abort mechanism differs fundamentally from SGLang's:

Aspect	SGLang	vLLM
Abort granularity	Per-request via `POST /abort_request` with `rid`	Per-connection via HTTP disconnect
Bulk abort	`{"abort_all": true}`	No built-in equivalent
Mechanism	Engine tracks `request_id`, explicit `abort()`	`@with_cancellation` decorator; request cancelled when client disconnects
Between-generation abort	Abort + restart	`POST /pause` → training → `POST /resume`

Recommended implementation

For the Slime RL use case, the rollout calls abort_all between generation rounds (to clear the engine before the next batch). The best vLLM equivalent is:

# In the translation sidecar
@app.post("/abort_request")
async def abort_request(request: Request):
    body = await request.json()
    if body.get("abort_all"):
        # Option 1: Close all tracked httpx connections → triggers vLLM cancellation
        for conn in active_connections:
            await conn.aclose()
        active_connections.clear()

        # Option 2: Use pause/resume (cleaner)
        await httpx.post(f"{vllm_url}/pause")
        await httpx.post(f"{vllm_url}/resume")

    return {"status": "ok"}

6. Endpoints Summary — Gap Analysis

Engine-side endpoints (vLLM built-in vs. needs implementation)

Endpoint	SGLang	vLLM Built-in	Action
`POST /v1/completions`	—	✅	Reuse — target for translation
`GET /health`	✅	✅	Reuse as-is (passthrough)
`POST /pause`	—	✅ (dev mode)	Reuse for abort/weight-update
`POST /resume`	—	✅ (dev mode)	Reuse for abort/weight-update
`POST /sleep`	—	✅ (dev mode)	Reuse for weight updates
`POST /wake_up`	—	✅ (dev mode)	Reuse for weight updates
`POST /collective_rpc`	—	✅ (dev mode)	Reuse for weight reload
`GET /is_sleeping`	—	✅ (dev mode)	Reuse for state checks
`POST /init_weight_transfer_engine`	—	✅ (dev mode)	Reuse for NCCL setup
`POST /update_weights`	—	✅ (dev mode)	Reuse for NCCL weight apply
`GET /get_world_size`	—	✅ (dev mode)	Reuse for TP world size

Translation sidecar endpoints (to implement)

Endpoint	Description	Complexity
`POST /generate`	Translate SGLang → `/v1/completions` → SGLang	Medium — main logic
`GET /health`	Proxy to vLLM `/health`	Trivial
`GET /health_generate`	Health + optional dummy completion	Low
`POST /abort_request`	Close connections or pause/resume	Low
`GET /flush_cache`	`POST /sleep?level=1` + `POST /wake_up?tags=kv_cache`	Low
`GET /get_weight_version`	Return sidecar-tracked version counter	Trivial

Router endpoints (no change needed)

Endpoint	Action
`POST /add_worker`	No change
`GET /list_workers`	No change
`POST /retrieve_from_text`	No change
Catch-all proxy	No change

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

gemini-code-assist · 2026-03-06T03:19:57Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a detailed Request for Comments (RFC) outlining a design to integrate vLLM as the new inference backend for SlimeRouter, replacing the existing SGLang backend. The core principle of this proposal is to achieve this transition without modifying the current SlimeRouter or its middleware. This is accomplished by introducing a dedicated translation sidecar that interfaces with vLLM's OpenAI-compatible API server, ensuring seamless communication and operation between the systems.

Highlights

Backend Transition Proposal: Proposed replacing the SGLang inference backend with vLLM for SlimeRouter, aiming to enhance performance and capabilities.
Router Reusability: The design ensures that the existing SlimeRouter and its middleware stack remain completely unchanged, preserving current routing logic and features.
Translation Layer Introduction: A new vllm_translation_sidecar.py component is introduced to handle the translation of SGLang-format requests and responses to and from vLLM's OpenAI-compatible API.
New Component Definition: Defined new components, including the vllm_translation_sidecar.py and vllm_engine.py (a Ray actor), to manage the vLLM server lifecycle and integration.
Detailed API Contract: Outlined a comprehensive API contract for the translation sidecar, detailing request/response translation for the /generate endpoint and handling of lifecycle endpoints like /health and /abort_request.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/en/vllm/ROUTER_DESIGN.md
- Added a detailed Request for Comments (RFC) document outlining the proposed design for integrating vLLM as the inference backend for SlimeRouter.

Activity

This pull request is a new proposal for discussion and has not yet undergone review.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive and well-structured design document (RFC) for integrating vLLM as a backend for SlimeRouter. The proposal is detailed and covers architecture, API contracts, and implementation strategies. My review focuses on ensuring the design's robustness and clarity, particularly regarding its assumptions about vLLM's behavior. I've identified a potentially critical issue in the proposed abort strategy and a few areas where more clarification would strengthen the design, such as the cache flushing mechanism and specific server configurations.

_{Note: Security Review has been skipped due to the limited scope of the PR.}

gemini-code-assist · 2026-03-06T03:21:16Z

docs/en/vllm/ROUTER_DESIGN.md

+        # Option 2: Use pause/resume (cleaner)
+        await httpx.post(f"{vllm_url}/pause")
+        await httpx.post(f"{vllm_url}/resume")


The proposed "Option 2" for implementing abort_all by calling /pause and /resume seems questionable. My understanding of vLLM's /pause endpoint is that it prevents new requests from being scheduled but does not terminate requests already in progress. This would not achieve the "abort all" semantic required by the rollout process.

Could you please verify the behavior of the /pause endpoint? If it doesn't abort in-flight requests, this design should strongly recommend Option 1 (tracking and closing HTTP connections), as that aligns with vLLM's standard cancellation mechanism. This is a critical part of the design to get right for ensuring the engine can be cleared between generation rounds.

gemini-code-assist · 2026-03-06T03:21:16Z

docs/en/vllm/ROUTER_DESIGN.md

+  "input_ids": [128000, 2610, 553, 264, 11190, 18328, 13],
+  "input_tokens": [128000, 2610, 553, 264, 11190, 18328, 13],


The example for an incoming /generate request includes both input_ids and input_tokens with identical values. This could be a source of confusion or bugs if they were to ever diverge.

To improve clarity and robustness, it would be helpful to specify:

Are both fields always required?

If both are present and differ, which one takes precedence?

Could one of them be considered redundant and removed from the contract to simplify the API?

gemini-code-assist · 2026-03-06T03:21:16Z

docs/en/vllm/ROUTER_DESIGN.md

+    --port <engine_port> \
+    --tensor-parallel-size <tp_size> \
+    --enable-sleep-mode \
+    --enforce-eager \


The server launch configuration includes the --enforce-eager flag. This flag alters vLLM's memory management and can have performance implications. For completeness, could you add a brief note explaining why this flag is necessary or beneficial for the Slime RL use case? This would help future readers understand the rationale behind this specific configuration choice.

gemini-code-assist · 2026-03-06T03:21:16Z

docs/en/vllm/ROUTER_DESIGN.md

+| `GET /health` | Proxy to vLLM `/health` | **Trivial** |
+| `GET /health_generate` | Health + optional dummy completion | **Low** |
+| `POST /abort_request` | Close connections or pause/resume | **Low** |
+| `GET /flush_cache` | `POST /sleep?level=1` + `POST /wake_up?tags=kv_cache` | **Low** |


The proposed implementation for GET /flush_cache using POST /sleep?level=1 followed by POST /wake_up?tags=kv_cache is a bit confusing. The sleep command offloads the KV cache from the GPU, but wake_up with tags=kv_cache seems to imply reloading it. This sequence doesn't intuitively translate to "flushing" the cache.

Could you clarify the exact semantics of this operation? Does wake_up re-initialize an empty cache, effectively flushing it? Explaining this would improve the clarity of the design. If there's a more direct way to clear the KV cache in vLLM, that might be preferable.

Draft router design

be91022

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

knlnguyen1802 force-pushed the router_design branch from d644f19 to be91022 Compare March 9, 2026 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not merge] Propose router design for vLLM#4

[Do not merge] Propose router design for vLLM#4
knlnguyen1802 wants to merge 1 commit intoSamitHuang:dev_vllmfrom
knlnguyen1802:router_design

knlnguyen1802 commented Mar 6, 2026

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

gemini-code-assist bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		"input_ids": [128000, 2610, 553, 264, 11190, 18328, 13],
		"input_tokens": [128000, 2610, 553, 264, 11190, 18328, 13],

Conversation

knlnguyen1802 commented Mar 6, 2026

RFC: Replace SGLang Backend with vLLM — Router Integration

Summary

1. Target Architecture

What stays the same

What is new

2. Reusing SlimeRouter — Zero Modification

2.1 Worker Registration

2.2 Request Proxying

2.3 Health Check

2.4 Worker Listing

2.5 Retrieve from Text (Radix Tree)

3. API Contract — What the Translation Sidecar Must Expose

3.1 POST /generate — Generation

Incoming Request (from router)

Translated Request (to vLLM /v1/completions)

vLLM Response (from /v1/completions)

Translated Response (returned to router)

Field-by-field contract

3.2 GET /health — Health Check

3.3 POST /abort_request — Cancel Generation

3.4 GET /health_generate — Startup Readiness Probe

3.5 Sampling Params Translation

3.6 Response Translation Pseudocode

3.7 finish_reason Translation Table

4. Server Launch Configuration

5. Abort Strategy — Detailed Design

Recommended implementation

6. Endpoints Summary — Gap Analysis

Engine-side endpoints (vLLM built-in vs. needs implementation)

Translation sidecar endpoints (to implement)

Router endpoints (no change needed)

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

3.1 `POST /generate` — Generation

Translated Request (to vLLM `/v1/completions`)

vLLM Response (from `/v1/completions`)

3.2 `GET /health` — Health Check

3.3 `POST /abort_request` — Cancel Generation

3.4 `GET /health_generate` — Startup Readiness Probe

3.7 `finish_reason` Translation Table