[Do not merge] Design for complete separate vllm and sglang backend by knlnguyen1802 · Pull Request #6 · SamitHuang/slime

knlnguyen1802 · 2026-03-30T08:57:12Z

RFC: Rollout Separation Plan (EngineGroup Generalization + Executor Cleanup)

1. Summary

This RFC proposes backend separation with minimal churn in runtime orchestration.

Scope is four items:

Refactor slime/ray/rollout.py by generalizing EngineGroup.start_engines() and abstracting engine/server creation (no new runtime manager class hierarchy).
Refactor slime/rollout/sglang_rollout.py in-place: extract the one SGLang-specific code path (RadixTree) into a strategy hook, rename SGLang-prefixed args to generic names. No class hierarchy, no new files.
Refactor slime/utils/arguments.py into shared args + backend arg groups/finalizers.
Decouple slime/backends/fsdp_utils/update_weight_utils.py from SGLang internals so FSDP weight sync works with both SGLang and vLLM engines.

2. Already Done (Reuse, Do Not Rewrite)

Unified rollout contracts in slime/rollout/base_types.py.
Backend client abstraction in slime/rollout/backends/base_client.py.
Backend adapters in slime/rollout/backends/sglang_client.py and slime/rollout/backends/vllm_client.py.
Managed vLLM engine actor in slime/backends/vllm_utils/vllm_engine.py.
vLLM translation sidecar in slime/backends/vllm_utils/vllm_translation_sidecar.py.

3. Problem

slime/ray/rollout.py mixes shared and backend-specific engine creation paths.
slime/rollout/sglang_rollout.py is ~95% backend-agnostic but has one inlined SGLang-specific code path (RadixTree, ~14 lines) and uses SGLang-prefixed arg names for generic rollout concepts.
slime/utils/arguments.py still has SGLang alias behavior in vLLM path.
slime/backends/fsdp_utils/update_weight_utils.py hard-imports SGLang internals (FlattenedTensorBucket, MultiprocessingSerializer, monkey_patch_torch_reductions) and calls SGLang-specific engine RPC names (update_weights_from_tensor, update_weights_from_distributed), making FSDP weight sync unusable with vLLM engines.

4. Goals and Non-Goals

Goals

Keep runtime refactor minimal and localized to EngineGroup + creation abstraction.
Isolate the one SGLang-specific executor code path behind a strategy hook; keep functions as functions.
Rename SGLang-prefixed arg names to generic rollout names to eliminate naming coupling.
Reduce backend leakage in argument finalization.
Preserve current external behavior, call sites, and import paths.

Non-Goals

No rewrite of SGLangEngine or VLLMEngine internals.
No algorithmic changes to GRPO/PPO.
No mandatory feature parity for unsupported backend capabilities.

5. Design

5.1 Runtime: Generalize `EngineGroup` (No New Runtime Manager Classes)

Keep slime/ray/rollout.py as the orchestration entry file.

Refactor focus:

Generalize EngineGroup.start_engines() to call backend-aware creation hooks.
Abstract engine creation and rollout-server assembly helpers.
Keep existing startup function API (start_rollout_servers) and return shape.

Proposed helper abstraction points:

create_engine_actor_cls(args, worker_type)
- returns ray.remote(SGLangEngine) or ray.remote(VLLMEngine).
create_engine_remote(args, actor_cls, scheduling_strategy, ...)
- encapsulates .options(...).remote(...) with backend-specific init kwargs.
build_rollout_server(...)
- standardizes RolloutServer construction from engine groups.

EngineGroup and RolloutServer remain the main shared dataclasses.

5.2 Executor: Isolate Backend Logic In-Place (No Class Hierarchy)

Current state analysis

slime/rollout/sglang_rollout.py (577 lines) is already ~95% backend-agnostic:

Function / Class	Lines	Backend-specific?	Notes
`_get_backend_client()`	6	Factory only	Delegates to existing `RolloutBackendClient` subclasses
`_apply_backend_response()`	20	No	Uses `RolloutBackendResponse` contract
`GenerateState`	37	Naming only	References `sglang_server_concurrency`, `sglang_dp_size`, `sglang_enable_deterministic_inference` — all are generic rollout concepts with SGLang-prefixed names
`generate()`	58	14 lines	RadixTree middleware path (L170-183) is 100% SGLang-specific; the else branch (L185-195) already uses `RolloutBackendClient`
`generate_and_rm()`	60	No	Shared orchestration (semaphore, custom func, reward)
`generate_and_rm_group()`	37	No	Group parallelism + deterministic seeds
`abort()`	38	No	Already uses `backend.abort()`
`generate_rollout_async()`	71	No	Main loop, filtering, metrics
`eval_rollout()` / `eval_rollout_single_dataset()`	118	Naming only	`sglang_enable_deterministic_inference` reference
`generate_rollout()`	41	No	Sync entry point

Conclusion: the actual backend logic that needs isolation is one code path (~14 lines) inside generate(). Everything else is either already abstracted through RolloutBackendClient or is a naming-only coupling (SGLang-prefixed arg names for generic concepts).

Approach

Extract the RadixTree path into a strategy hook that generate() calls conditionally.
Rename SGLang-prefixed args to generic names (coordinated with Phase 3 args refactor).
Keep functions as functions — they compose well and callers (train.py, OPD, multi-agent) import them directly.

Concrete changes

Step 1 — Extract RadixTree strategy from generate()

Current generate() has an if use_radix: ... else: backend.generate(...) branch.
Refactor into:

# slime/rollout/sglang_rollout.py — generate() simplified

async def generate(args, sample, sampling_params):
    ...
    input_ids = ...  # shared prompt encoding (unchanged)

    strategy = _get_generate_strategy(args)
    resp = await strategy(args, sample, input_ids, sampling_params)
    _apply_backend_response(sample, resp, args)
    return sample

Two strategies:

# Still in sglang_rollout.py (no new file needed)

def _get_generate_strategy(args):
    """Return the generate coroutine to use."""
    if _is_radix_tree_enabled(args):
        return _generate_radix_tree      # SGLang-only path
    return _generate_via_backend_client   # Generic path (SGLang or vLLM)

def _is_radix_tree_enabled(args) -> bool:
    return (
        args.use_slime_router
        and "RadixTreeMiddleware" in getattr(args, "slime_router_middleware_paths", [])
    )

async def _generate_radix_tree(args, sample, input_ids, sampling_params) -> RolloutBackendResponse:
    """SGLang RadixTree middleware path — returns normalized response."""
    from slime.router.middleware_hub.radix_tree_middleware import postprocess_sample_with_radix_tree
    url = f"http://{args.rollout_router_ip}:{args.rollout_router_port}/generate"
    payload = { ... }  # existing payload construction
    output = await post(url, payload, headers=headers)
    sample = await postprocess_sample_with_radix_tree(args, sample, output)
    return _extract_response_from_sample(sample)  # normalize to RolloutBackendResponse

async def _generate_via_backend_client(args, sample, input_ids, sampling_params) -> RolloutBackendResponse:
    """Generic backend client path — works for SGLang and vLLM."""
    backend = _get_backend_client(args)
    base_url = f"http://{args.rollout_router_ip}:{args.rollout_router_port}"
    req = RolloutBackendRequest(...)
    return await backend.generate(req, base_url, headers=headers)

Step 2 — Rename SGLang-prefixed args to generic names

Current name	New name	Reason
`sglang_server_concurrency`	`rollout_concurrency`	Controls request parallelism for any backend
`sglang_dp_size`	`rollout_dp_size`	Data-parallel sharding, not SGLang-specific
`sglang_router_ip` / `sglang_router_port`	`rollout_router_ip` / `rollout_router_port`	Router endpoint, backend-agnostic
`sglang_router_policy`	`rollout_router_policy`	Routing strategy
`sglang_enable_deterministic_inference`	`rollout_deterministic_inference`	Seed-based determinism
`vllm_base_url`	(remove)	Folded into `rollout_router_ip:port`, no special case

Legacy aliases kept in slime/utils/arguments.py for one release cycle (coordinated with Phase 3).

Step 3 — No new files

The file stays as slime/rollout/sglang_rollout.py during this phase.
Optionally rename to slime/rollout/rollout.py in Phase 4 cleanup, since the file is backend-agnostic after the refactor.

5.3 Arguments/Config Refactor

In slime/utils/arguments.py, split into:

Shared rollout args.
SGLang backend args/validation.
vLLM backend args/validation.

Add backend finalizers:

finalize_sglang_args(args)
finalize_vllm_args(args)

Move SGLang alias fallback out of shared finalize flow.

5.4 Weight Sync: Decouple FSDP `update_weight_utils.py` from SGLang Internals

Current state analysis

slime/backends/fsdp_utils/update_weight_utils.py (287 lines) has two concrete classes:

Class	Weight-push method	SGLang coupling
`UpdateWeightFromTensor`	IPC via Gloo gather → `engine.update_weights_from_tensor.remote()`	Imports `FlattenedTensorBucket`, `MultiprocessingSerializer`, `monkey_patch_torch_reductions` directly from `sglang.srt.*`
`UpdateWeightFromDistributed`	NCCL broadcast → `engine.update_weights_from_distributed.remote()`	Calls `engine.init_weights_update_group.remote()` — SGLang engine API

The abstract base UpdateWeight itself is clean (only PyTorch + Ray).

The Megatron side already solved this: slime/backends/megatron_utils/sglang.py centralizes all SGLang imports into one shim. The FSDP side duplicates these imports inline.

Coupling points

SGLang utility imports (L13-26) — monkey_patch_torch_reductions, MultiprocessingSerializer, FlattenedTensorBucket are imported directly from sglang.srt.* with try/except version fallbacks.
UpdateWeightFromTensor.update_bucket_weights() — uses FlattenedTensorBucket to flatten tensors, MultiprocessingSerializer to serialize, then calls engine.update_weights_from_tensor.remote().
UpdateWeightFromDistributed.connect_rollout_engines() — calls engine.init_weights_update_group.remote() which is an SGLang engine method.
UpdateWeightFromDistributed.update_bucket_weights() — calls engine.update_weights_from_distributed.remote() which is an SGLang engine method.

All four points assume the engine actor exposes SGLang's RPC interface. vLLM engines expose different method names.

Approach

Step 1 — Centralize SGLang imports via a shim (same pattern as Megatron)

Reuse or mirror the existing slime/backends/megatron_utils/sglang.py pattern:

# slime/backends/fsdp_utils/sglang_compat.py (new, ~15 lines)
try:
    from sglang.srt.utils.patch_torch import monkey_patch_torch_reductions
except ImportError:
    from sglang.srt.patch_torch import monkey_patch_torch_reductions

from sglang.srt.utils import MultiprocessingSerializer

try:
    from sglang.srt.weight_sync.tensor_bucket import FlattenedTensorBucket
except ImportError:
    from sglang.srt.model_executor.model_runner import FlattenedTensorBucket

Then update_weight_utils.py imports from sglang_compat — one import line instead of five, and only loaded when SGLang is the active backend.

Step 2 — Abstract engine RPC calls behind a protocol

The engine actors (SGLangEngine, VLLMEngine) already expose weight-sync methods but with different names/signatures. Introduce a lightweight protocol or adapter:

# In update_weight_utils.py or a small helper

def _call_engine_update_tensor(engine, backend: str, **kwargs):
    """Dispatch IPC weight update to the correct engine method."""
    if backend == "vllm":
        return engine.update_weights_from_tensor.remote(**kwargs)  # vLLM uses same name via NcclBridge
    return engine.update_weights_from_tensor.remote(**kwargs)  # SGLang native

def _call_engine_update_distributed(engine, backend: str, **kwargs):
    if backend == "vllm":
        return engine.update_weights_from_distributed.remote(**kwargs)
    return engine.update_weights_from_distributed.remote(**kwargs)

def _call_engine_init_weight_group(engine, backend: str, **kwargs):
    if backend == "vllm":
        return engine.init_weights_update_group.remote(**kwargs)
    return engine.init_weights_update_group.remote(**kwargs)

Note: Currently VLLMEngine already mirrors these method names (it wraps them via NcclBridge), so the dispatch functions may initially be identical. The value is making the indirection explicit so future method-name divergence is handled in one place.

Step 3 — Lazy-import SGLang utilities only when backend is SGLang

Move the FlattenedTensorBucket / MultiprocessingSerializer imports inside UpdateWeightFromTensor.update_bucket_weights() behind a lazy import, so the module can be loaded in a vLLM-only environment without SGLang installed.

What changes

File	Change
`slime/backends/fsdp_utils/sglang_compat.py`	New shim file (~15 lines) centralizing SGLang imports
`slime/backends/fsdp_utils/update_weight_utils.py`	Replace 5 inline SGLang imports with one `from .sglang_compat import ...`; add backend-aware engine dispatch helpers

What does NOT change

UpdateWeight abstract base class — already clean.
UpdateWeightFromDistributed NCCL logic — the broadcast itself is pure PyTorch; only the engine RPC dispatch gets a thin wrapper.
Megatron-side weight sync — already has its own shim, not touched.

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

gemini-code-assist

Code Review

This pull request outlines an RFC for backend separation, involving the generalization of engine groups and the decoupling of weight synchronization from SGLang internals. A review comment suggests simplifying the proposed dispatch functions for weight updates to eliminate redundant conditional logic where backend method names are currently identical.

gemini-code-assist · 2026-03-30T08:58:55Z

rfc-rollout-backend-separation-plan.md

+def _call_engine_update_tensor(engine, backend: str, **kwargs):
+    """Dispatch IPC weight update to the correct engine method."""
+    if backend == "vllm":
+        return engine.update_weights_from_tensor.remote(**kwargs)  # vLLM uses same name via NcclBridge
+    return engine.update_weights_from_tensor.remote(**kwargs)  # SGLang native
+
+def _call_engine_update_distributed(engine, backend: str, **kwargs):
+    if backend == "vllm":
+        return engine.update_weights_from_distributed.remote(**kwargs)
+    return engine.update_weights_from_distributed.remote(**kwargs)
+
+def _call_engine_init_weight_group(engine, backend: str, **kwargs):
+    if backend == "vllm":
+        return engine.init_weights_update_group.remote(**kwargs)
+    return engine.init_weights_update_group.remote(**kwargs)


The proposed dispatch functions in this section contain redundant if/else blocks where both branches execute identical code. While the intention to create an abstraction layer for future divergence is clear, the current implementation is confusing.

If the method names on SGLangEngine and VLLMEngine are currently identical (as the note suggests), you can simplify these functions to remove the conditional logic. This makes the code cleaner and still provides a single point of modification if the method names diverge later.

For example:

# In update_weight_utils.py or a small helper def _call_engine_update_tensor(engine, **kwargs): """Dispatch IPC weight update to the correct engine method.""" # This wrapper provides a single point of dispatch. # Currently, both vLLM and SGLang engines expose the same method name. return engine.update_weights_from_tensor.remote(**kwargs) def _call_engine_update_distributed(engine, **kwargs): """Dispatch distributed weight update to the correct engine method.""" return engine.update_weights_from_distributed.remote(**kwargs) def _call_engine_init_weight_group(engine, **kwargs): """Dispatch weight group initialization to the correct engine method.""" return engine.init_weights_update_group.remote(**kwargs)

This simplification makes it clearer that the abstraction relies on the engine objects conforming to a common interface. If you anticipate different keyword arguments or method names for different backends in the near future, it would be better to keep the if/else structure but perhaps add a comment explaining why it's currently redundant.

Plan refactor vllm/sglang

c154078

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

gemini-code-assist bot reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not merge] Design for complete separate vllm and sglang backend#6

[Do not merge] Design for complete separate vllm and sglang backend#6
knlnguyen1802 wants to merge 1 commit intoSamitHuang:dev_vllmfrom
knlnguyen1802:refactor

knlnguyen1802 commented Mar 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

knlnguyen1802 commented Mar 30, 2026

RFC: Rollout Separation Plan (EngineGroup Generalization + Executor Cleanup)

1. Summary

2. Already Done (Reuse, Do Not Rewrite)

3. Problem

4. Goals and Non-Goals

Goals

Non-Goals

5. Design

5.1 Runtime: Generalize EngineGroup (No New Runtime Manager Classes)

5.2 Executor: Isolate Backend Logic In-Place (No Class Hierarchy)

Current state analysis

Approach

Concrete changes

5.3 Arguments/Config Refactor

5.4 Weight Sync: Decouple FSDP update_weight_utils.py from SGLang Internals

Current state analysis

Coupling points

Approach

What changes

What does NOT change

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

5.1 Runtime: Generalize `EngineGroup` (No New Runtime Manager Classes)

5.4 Weight Sync: Decouple FSDP `update_weight_utils.py` from SGLang Internals