[Do not merge] Design for complete separate vllm and sglang backend#6
[Do not merge] Design for complete separate vllm and sglang backend#6knlnguyen1802 wants to merge 1 commit intoSamitHuang:dev_vllmfrom
Conversation
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request outlines an RFC for backend separation, involving the generalization of engine groups and the decoupling of weight synchronization from SGLang internals. A review comment suggests simplifying the proposed dispatch functions for weight updates to eliminate redundant conditional logic where backend method names are currently identical.
| def _call_engine_update_tensor(engine, backend: str, **kwargs): | ||
| """Dispatch IPC weight update to the correct engine method.""" | ||
| if backend == "vllm": | ||
| return engine.update_weights_from_tensor.remote(**kwargs) # vLLM uses same name via NcclBridge | ||
| return engine.update_weights_from_tensor.remote(**kwargs) # SGLang native | ||
|
|
||
| def _call_engine_update_distributed(engine, backend: str, **kwargs): | ||
| if backend == "vllm": | ||
| return engine.update_weights_from_distributed.remote(**kwargs) | ||
| return engine.update_weights_from_distributed.remote(**kwargs) | ||
|
|
||
| def _call_engine_init_weight_group(engine, backend: str, **kwargs): | ||
| if backend == "vllm": | ||
| return engine.init_weights_update_group.remote(**kwargs) | ||
| return engine.init_weights_update_group.remote(**kwargs) |
There was a problem hiding this comment.
The proposed dispatch functions in this section contain redundant if/else blocks where both branches execute identical code. While the intention to create an abstraction layer for future divergence is clear, the current implementation is confusing.
If the method names on SGLangEngine and VLLMEngine are currently identical (as the note suggests), you can simplify these functions to remove the conditional logic. This makes the code cleaner and still provides a single point of modification if the method names diverge later.
For example:
# In update_weight_utils.py or a small helper
def _call_engine_update_tensor(engine, **kwargs):
"""Dispatch IPC weight update to the correct engine method."""
# This wrapper provides a single point of dispatch.
# Currently, both vLLM and SGLang engines expose the same method name.
return engine.update_weights_from_tensor.remote(**kwargs)
def _call_engine_update_distributed(engine, **kwargs):
"""Dispatch distributed weight update to the correct engine method."""
return engine.update_weights_from_distributed.remote(**kwargs)
def _call_engine_init_weight_group(engine, **kwargs):
"""Dispatch weight group initialization to the correct engine method."""
return engine.init_weights_update_group.remote(**kwargs)This simplification makes it clearer that the abstraction relies on the engine objects conforming to a common interface. If you anticipate different keyword arguments or method names for different backends in the near future, it would be better to keep the if/else structure but perhaps add a comment explaining why it's currently redundant.
RFC: Rollout Separation Plan (EngineGroup Generalization + Executor Cleanup)
1. Summary
This RFC proposes backend separation with minimal churn in runtime orchestration.
Scope is four items:
EngineGroup.start_engines()and abstracting engine/server creation (no new runtime manager class hierarchy).2. Already Done (Reuse, Do Not Rewrite)
3. Problem
FlattenedTensorBucket,MultiprocessingSerializer,monkey_patch_torch_reductions) and calls SGLang-specific engine RPC names (update_weights_from_tensor,update_weights_from_distributed), making FSDP weight sync unusable with vLLM engines.4. Goals and Non-Goals
Goals
EngineGroup+ creation abstraction.Non-Goals
SGLangEngineorVLLMEngineinternals.5. Design
5.1 Runtime: Generalize
EngineGroup(No New Runtime Manager Classes)Keep slime/ray/rollout.py as the orchestration entry file.
Refactor focus:
EngineGroup.start_engines()to call backend-aware creation hooks.start_rollout_servers) and return shape.Proposed helper abstraction points:
create_engine_actor_cls(args, worker_type)ray.remote(SGLangEngine)orray.remote(VLLMEngine).create_engine_remote(args, actor_cls, scheduling_strategy, ...).options(...).remote(...)with backend-specific init kwargs.build_rollout_server(...)RolloutServerconstruction from engine groups.EngineGroupandRolloutServerremain the main shared dataclasses.5.2 Executor: Isolate Backend Logic In-Place (No Class Hierarchy)
Current state analysis
slime/rollout/sglang_rollout.py (577 lines) is already ~95% backend-agnostic:
_get_backend_client()RolloutBackendClientsubclasses_apply_backend_response()RolloutBackendResponsecontractGenerateStatesglang_server_concurrency,sglang_dp_size,sglang_enable_deterministic_inference— all are generic rollout concepts with SGLang-prefixed namesgenerate()RolloutBackendClientgenerate_and_rm()generate_and_rm_group()abort()backend.abort()generate_rollout_async()eval_rollout()/eval_rollout_single_dataset()sglang_enable_deterministic_inferencereferencegenerate_rollout()Conclusion: the actual backend logic that needs isolation is one code path (~14 lines) inside
generate(). Everything else is either already abstracted throughRolloutBackendClientor is a naming-only coupling (SGLang-prefixed arg names for generic concepts).Approach
generate()calls conditionally.train.py, OPD, multi-agent) import them directly.Concrete changes
Step 1 — Extract RadixTree strategy from
generate()Current
generate()has anif use_radix: ... else: backend.generate(...)branch.Refactor into:
Two strategies:
Step 2 — Rename SGLang-prefixed args to generic names
sglang_server_concurrencyrollout_concurrencysglang_dp_sizerollout_dp_sizesglang_router_ip/sglang_router_portrollout_router_ip/rollout_router_portsglang_router_policyrollout_router_policysglang_enable_deterministic_inferencerollout_deterministic_inferencevllm_base_urlrollout_router_ip:port, no special caseLegacy aliases kept in slime/utils/arguments.py for one release cycle (coordinated with Phase 3).
Step 3 — No new files
The file stays as slime/rollout/sglang_rollout.py during this phase.
Optionally rename to
slime/rollout/rollout.pyin Phase 4 cleanup, since the file is backend-agnostic after the refactor.5.3 Arguments/Config Refactor
In slime/utils/arguments.py, split into:
Add backend finalizers:
finalize_sglang_args(args)finalize_vllm_args(args)Move SGLang alias fallback out of shared finalize flow.
5.4 Weight Sync: Decouple FSDP
update_weight_utils.pyfrom SGLang InternalsCurrent state analysis
slime/backends/fsdp_utils/update_weight_utils.py (287 lines) has two concrete classes:
UpdateWeightFromTensorengine.update_weights_from_tensor.remote()FlattenedTensorBucket,MultiprocessingSerializer,monkey_patch_torch_reductionsdirectly fromsglang.srt.*UpdateWeightFromDistributedengine.update_weights_from_distributed.remote()engine.init_weights_update_group.remote()— SGLang engine APIThe abstract base
UpdateWeightitself is clean (only PyTorch + Ray).The Megatron side already solved this: slime/backends/megatron_utils/sglang.py centralizes all SGLang imports into one shim. The FSDP side duplicates these imports inline.
Coupling points
monkey_patch_torch_reductions,MultiprocessingSerializer,FlattenedTensorBucketare imported directly fromsglang.srt.*with try/except version fallbacks.UpdateWeightFromTensor.update_bucket_weights()— usesFlattenedTensorBucketto flatten tensors,MultiprocessingSerializerto serialize, then callsengine.update_weights_from_tensor.remote().UpdateWeightFromDistributed.connect_rollout_engines()— callsengine.init_weights_update_group.remote()which is an SGLang engine method.UpdateWeightFromDistributed.update_bucket_weights()— callsengine.update_weights_from_distributed.remote()which is an SGLang engine method.All four points assume the engine actor exposes SGLang's RPC interface. vLLM engines expose different method names.
Approach
Step 1 — Centralize SGLang imports via a shim (same pattern as Megatron)
Reuse or mirror the existing slime/backends/megatron_utils/sglang.py pattern:
Then
update_weight_utils.pyimports fromsglang_compat— one import line instead of five, and only loaded when SGLang is the active backend.Step 2 — Abstract engine RPC calls behind a protocol
The engine actors (
SGLangEngine,VLLMEngine) already expose weight-sync methods but with different names/signatures. Introduce a lightweight protocol or adapter:Step 3 — Lazy-import SGLang utilities only when backend is SGLang
Move the
FlattenedTensorBucket/MultiprocessingSerializerimports insideUpdateWeightFromTensor.update_bucket_weights()behind a lazy import, so the module can be loaded in a vLLM-only environment without SGLang installed.What changes
slime/backends/fsdp_utils/sglang_compat.pyslime/backends/fsdp_utils/update_weight_utils.pyfrom .sglang_compat import ...; add backend-aware engine dispatch helpersWhat does NOT change
UpdateWeightabstract base class — already clean.UpdateWeightFromDistributedNCCL logic — the broadcast itself is pure PyTorch; only the engine RPC dispatch gets a thin wrapper.