forked from THUDM/slime
-
Notifications
You must be signed in to change notification settings - Fork 1
[feat][Draft] Refactor slime to work with custom Megatron and vllm version #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
knlnguyen1802
wants to merge
13
commits into
SamitHuang:dev_vllm
Choose a base branch
from
knlnguyen1802:refactor_dev
base: dev_vllm
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
c154078
Plan refactor vllm/sglang
knlnguyen1802 1a2dcf5
Code implemented
knlnguyen1802 8addb37
Fix bug
knlnguyen1802 9689a97
Fix bug
knlnguyen1802 3753432
Fix bug
knlnguyen1802 f1e7554
Fix port
knlnguyen1802 91cc780
Fix config
knlnguyen1802 be1ecd4
Fix bug MOE weight sync
knlnguyen1802 e7216d8
Fix bug vllm transfer weight
knlnguyen1802 2343647
Fix weight sync
knlnguyen1802 8498b7b
Fix
knlnguyen1802 8a41184
Fix config
knlnguyen1802 acc9690
Change name config
knlnguyen1802 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,271 @@ | ||
| # RFC: Rollout Separation Plan (EngineGroup Generalization + Executor Cleanup) | ||
|
|
||
| ## 1. Summary | ||
|
|
||
| This RFC proposes backend separation with minimal churn in runtime orchestration. | ||
|
|
||
| Scope is four items: | ||
|
|
||
| 1. Refactor [slime/ray/rollout.py](slime/ray/rollout.py) by generalizing `EngineGroup.start_engines()` and abstracting engine/server creation (no new runtime manager class hierarchy). | ||
| 2. Refactor [slime/rollout/sglang_rollout.py](slime/rollout/sglang_rollout.py) in-place: extract the one SGLang-specific code path (RadixTree) into a strategy hook, rename SGLang-prefixed args to generic names. No class hierarchy, no new files. | ||
| 3. Refactor [slime/utils/arguments.py](slime/utils/arguments.py) into shared args + backend arg groups/finalizers. | ||
| 4. Decouple [slime/backends/fsdp_utils/update_weight_utils.py](slime/backends/fsdp_utils/update_weight_utils.py) from SGLang internals so FSDP weight sync works with both SGLang and vLLM engines. | ||
|
|
||
| ## 2. Already Done (Reuse, Do Not Rewrite) | ||
|
|
||
| - Unified rollout contracts in [slime/rollout/base_types.py](slime/rollout/base_types.py). | ||
| - Backend client abstraction in [slime/rollout/backends/base_client.py](slime/rollout/backends/base_client.py). | ||
| - Backend adapters in [slime/rollout/backends/sglang_client.py](slime/rollout/backends/sglang_client.py) and [slime/rollout/backends/vllm_client.py](slime/rollout/backends/vllm_client.py). | ||
| - Managed vLLM engine actor in [slime/backends/vllm_utils/vllm_engine.py](slime/backends/vllm_utils/vllm_engine.py). | ||
| - vLLM translation sidecar in [slime/backends/vllm_utils/vllm_translation_sidecar.py](slime/backends/vllm_utils/vllm_translation_sidecar.py). | ||
|
|
||
| ## 3. Problem | ||
|
|
||
| - [slime/ray/rollout.py](slime/ray/rollout.py) mixes shared and backend-specific engine creation paths. | ||
| - [slime/rollout/sglang_rollout.py](slime/rollout/sglang_rollout.py) is ~95% backend-agnostic but has one inlined SGLang-specific code path (RadixTree, ~14 lines) and uses SGLang-prefixed arg names for generic rollout concepts. | ||
| - [slime/utils/arguments.py](slime/utils/arguments.py) still has SGLang alias behavior in vLLM path. | ||
| - [slime/backends/fsdp_utils/update_weight_utils.py](slime/backends/fsdp_utils/update_weight_utils.py) hard-imports SGLang internals (`FlattenedTensorBucket`, `MultiprocessingSerializer`, `monkey_patch_torch_reductions`) and calls SGLang-specific engine RPC names (`update_weights_from_tensor`, `update_weights_from_distributed`), making FSDP weight sync unusable with vLLM engines. | ||
|
|
||
| ## 4. Goals and Non-Goals | ||
|
|
||
| ### Goals | ||
|
|
||
| - Keep runtime refactor minimal and localized to `EngineGroup` + creation abstraction. | ||
| - Isolate the one SGLang-specific executor code path behind a strategy hook; keep functions as functions. | ||
| - Rename SGLang-prefixed arg names to generic rollout names to eliminate naming coupling. | ||
| - Reduce backend leakage in argument finalization. | ||
| - Preserve current external behavior, call sites, and import paths. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - No rewrite of `SGLangEngine` or `VLLMEngine` internals. | ||
| - No algorithmic changes to GRPO/PPO. | ||
| - No mandatory feature parity for unsupported backend capabilities. | ||
|
|
||
| ## 5. Design | ||
|
|
||
| ### 5.1 Runtime: Generalize `EngineGroup` (No New Runtime Manager Classes) | ||
|
|
||
| Keep [slime/ray/rollout.py](slime/ray/rollout.py) as the orchestration entry file. | ||
|
|
||
| Refactor focus: | ||
|
|
||
| 1. Generalize `EngineGroup.start_engines()` to call backend-aware creation hooks. | ||
| 2. Abstract engine creation and rollout-server assembly helpers. | ||
| 3. Keep existing startup function API (`start_rollout_servers`) and return shape. | ||
|
|
||
| Proposed helper abstraction points: | ||
|
|
||
| - `create_engine_actor_cls(args, worker_type)` | ||
| - returns `ray.remote(SGLangEngine)` or `ray.remote(VLLMEngine)`. | ||
| - `create_engine_remote(args, actor_cls, scheduling_strategy, ...)` | ||
| - encapsulates `.options(...).remote(...)` with backend-specific init kwargs. | ||
| - `build_rollout_server(...)` | ||
| - standardizes `RolloutServer` construction from engine groups. | ||
|
|
||
| `EngineGroup` and `RolloutServer` remain the main shared dataclasses. | ||
|
|
||
| ### 5.2 Executor: Isolate Backend Logic In-Place (No Class Hierarchy) | ||
|
|
||
| #### Current state analysis | ||
|
|
||
| [slime/rollout/sglang_rollout.py](slime/rollout/sglang_rollout.py) (577 lines) is **already ~95% backend-agnostic**: | ||
|
|
||
| | Function / Class | Lines | Backend-specific? | Notes | | ||
| |---|---|---|---| | ||
| | `_get_backend_client()` | 6 | Factory only | Delegates to existing `RolloutBackendClient` subclasses | | ||
| | `_apply_backend_response()` | 20 | No | Uses `RolloutBackendResponse` contract | | ||
| | `GenerateState` | 37 | **Naming only** | References `sglang_server_concurrency`, `sglang_dp_size`, `sglang_enable_deterministic_inference` — all are generic rollout concepts with SGLang-prefixed names | | ||
| | `generate()` | 58 | **14 lines** | RadixTree middleware path (L170-183) is 100% SGLang-specific; the else branch (L185-195) already uses `RolloutBackendClient` | | ||
| | `generate_and_rm()` | 60 | No | Shared orchestration (semaphore, custom func, reward) | | ||
| | `generate_and_rm_group()` | 37 | No | Group parallelism + deterministic seeds | | ||
| | `abort()` | 38 | No | Already uses `backend.abort()` | | ||
| | `generate_rollout_async()` | 71 | No | Main loop, filtering, metrics | | ||
| | `eval_rollout()` / `eval_rollout_single_dataset()` | 118 | **Naming only** | `sglang_enable_deterministic_inference` reference | | ||
| | `generate_rollout()` | 41 | No | Sync entry point | | ||
|
|
||
| **Conclusion**: the actual backend logic that needs isolation is **one code path** (~14 lines) inside `generate()`. Everything else is either already abstracted through `RolloutBackendClient` or is a naming-only coupling (SGLang-prefixed arg names for generic concepts). | ||
|
|
||
| #### Approach | ||
|
|
||
| 1. **Extract the RadixTree path into a strategy hook** that `generate()` calls conditionally. | ||
| 2. **Rename SGLang-prefixed args** to generic names (coordinated with Phase 3 args refactor). | ||
| 3. **Keep functions as functions** — they compose well and callers (`train.py`, OPD, multi-agent) import them directly. | ||
|
|
||
| #### Concrete changes | ||
|
|
||
| **Step 1 — Extract RadixTree strategy from `generate()`** | ||
|
|
||
| Current `generate()` has an `if use_radix: ... else: backend.generate(...)` branch. | ||
| Refactor into: | ||
|
|
||
| ```python | ||
| # slime/rollout/sglang_rollout.py — generate() simplified | ||
|
|
||
| async def generate(args, sample, sampling_params): | ||
| ... | ||
| input_ids = ... # shared prompt encoding (unchanged) | ||
|
|
||
| strategy = _get_generate_strategy(args) | ||
| resp = await strategy(args, sample, input_ids, sampling_params) | ||
| _apply_backend_response(sample, resp, args) | ||
| return sample | ||
| ``` | ||
|
|
||
| Two strategies: | ||
|
|
||
| ```python | ||
| # Still in sglang_rollout.py (no new file needed) | ||
|
|
||
| def _get_generate_strategy(args): | ||
| """Return the generate coroutine to use.""" | ||
| if _is_radix_tree_enabled(args): | ||
| return _generate_radix_tree # SGLang-only path | ||
| return _generate_via_backend_client # Generic path (SGLang or vLLM) | ||
|
|
||
| def _is_radix_tree_enabled(args) -> bool: | ||
| return ( | ||
| args.use_slime_router | ||
| and "RadixTreeMiddleware" in getattr(args, "slime_router_middleware_paths", []) | ||
| ) | ||
|
|
||
| async def _generate_radix_tree(args, sample, input_ids, sampling_params) -> RolloutBackendResponse: | ||
| """SGLang RadixTree middleware path — returns normalized response.""" | ||
| from slime.router.middleware_hub.radix_tree_middleware import postprocess_sample_with_radix_tree | ||
| url = f"http://{args.rollout_router_ip}:{args.rollout_router_port}/generate" | ||
| payload = { ... } # existing payload construction | ||
| output = await post(url, payload, headers=headers) | ||
| sample = await postprocess_sample_with_radix_tree(args, sample, output) | ||
| return _extract_response_from_sample(sample) # normalize to RolloutBackendResponse | ||
|
|
||
| async def _generate_via_backend_client(args, sample, input_ids, sampling_params) -> RolloutBackendResponse: | ||
| """Generic backend client path — works for SGLang and vLLM.""" | ||
| backend = _get_backend_client(args) | ||
| base_url = f"http://{args.rollout_router_ip}:{args.rollout_router_port}" | ||
| req = RolloutBackendRequest(...) | ||
| return await backend.generate(req, base_url, headers=headers) | ||
| ``` | ||
|
|
||
| **Step 2 — Rename SGLang-prefixed args to generic names** | ||
|
|
||
| | Current name | New name | Reason | | ||
| |---|---|---| | ||
| | `sglang_server_concurrency` | `rollout_concurrency` | Controls request parallelism for any backend | | ||
| | `sglang_dp_size` | `rollout_dp_size` | Data-parallel sharding, not SGLang-specific | | ||
| | `sglang_router_ip` / `sglang_router_port` | `rollout_router_ip` / `rollout_router_port` | Router endpoint, backend-agnostic | | ||
| | `sglang_router_policy` | `rollout_router_policy` | Routing strategy | | ||
| | `sglang_enable_deterministic_inference` | `rollout_deterministic_inference` | Seed-based determinism | | ||
| | `vllm_base_url` | (remove) | Folded into `rollout_router_ip:port`, no special case | | ||
|
|
||
| Legacy aliases kept in [slime/utils/arguments.py](slime/utils/arguments.py) for one release cycle (coordinated with Phase 3). | ||
|
|
||
| **Step 3 — No new files** | ||
|
|
||
| The file stays as [slime/rollout/sglang_rollout.py](slime/rollout/sglang_rollout.py) during this phase. | ||
| Optionally rename to `slime/rollout/rollout.py` in Phase 4 cleanup, since the file is backend-agnostic after the refactor. | ||
|
|
||
| ### 5.3 Arguments/Config Refactor | ||
|
|
||
| In [slime/utils/arguments.py](slime/utils/arguments.py), split into: | ||
|
|
||
| 1. Shared rollout args. | ||
| 2. SGLang backend args/validation. | ||
| 3. vLLM backend args/validation. | ||
|
|
||
| Add backend finalizers: | ||
|
|
||
| - `finalize_sglang_args(args)` | ||
| - `finalize_vllm_args(args)` | ||
|
|
||
| Move SGLang alias fallback out of shared finalize flow. | ||
|
|
||
| ### 5.4 Weight Sync: Decouple FSDP `update_weight_utils.py` from SGLang Internals | ||
|
|
||
| #### Current state analysis | ||
|
|
||
| [slime/backends/fsdp_utils/update_weight_utils.py](slime/backends/fsdp_utils/update_weight_utils.py) (287 lines) has two concrete classes: | ||
|
|
||
| | Class | Weight-push method | SGLang coupling | | ||
| |---|---|---| | ||
| | `UpdateWeightFromTensor` | IPC via Gloo gather → `engine.update_weights_from_tensor.remote()` | Imports `FlattenedTensorBucket`, `MultiprocessingSerializer`, `monkey_patch_torch_reductions` directly from `sglang.srt.*` | | ||
| | `UpdateWeightFromDistributed` | NCCL broadcast → `engine.update_weights_from_distributed.remote()` | Calls `engine.init_weights_update_group.remote()` — SGLang engine API | | ||
|
|
||
| The abstract base `UpdateWeight` itself is clean (only PyTorch + Ray). | ||
|
|
||
| The Megatron side already solved this: [slime/backends/megatron_utils/sglang.py](slime/backends/megatron_utils/sglang.py) centralizes all SGLang imports into one shim. The FSDP side duplicates these imports inline. | ||
|
|
||
| #### Coupling points | ||
|
|
||
| 1. **SGLang utility imports (L13-26)** — `monkey_patch_torch_reductions`, `MultiprocessingSerializer`, `FlattenedTensorBucket` are imported directly from `sglang.srt.*` with try/except version fallbacks. | ||
| 2. **`UpdateWeightFromTensor.update_bucket_weights()`** — uses `FlattenedTensorBucket` to flatten tensors, `MultiprocessingSerializer` to serialize, then calls `engine.update_weights_from_tensor.remote()`. | ||
| 3. **`UpdateWeightFromDistributed.connect_rollout_engines()`** — calls `engine.init_weights_update_group.remote()` which is an SGLang engine method. | ||
| 4. **`UpdateWeightFromDistributed.update_bucket_weights()`** — calls `engine.update_weights_from_distributed.remote()` which is an SGLang engine method. | ||
|
|
||
| All four points assume the engine actor exposes SGLang's RPC interface. vLLM engines expose different method names. | ||
|
|
||
| #### Approach | ||
|
|
||
| **Step 1 — Centralize SGLang imports via a shim (same pattern as Megatron)** | ||
|
|
||
| Reuse or mirror the existing [slime/backends/megatron_utils/sglang.py](slime/backends/megatron_utils/sglang.py) pattern: | ||
|
|
||
| ```python | ||
| # slime/backends/fsdp_utils/sglang_compat.py (new, ~15 lines) | ||
| try: | ||
| from sglang.srt.utils.patch_torch import monkey_patch_torch_reductions | ||
| except ImportError: | ||
| from sglang.srt.patch_torch import monkey_patch_torch_reductions | ||
|
|
||
| from sglang.srt.utils import MultiprocessingSerializer | ||
|
|
||
| try: | ||
| from sglang.srt.weight_sync.tensor_bucket import FlattenedTensorBucket | ||
| except ImportError: | ||
| from sglang.srt.model_executor.model_runner import FlattenedTensorBucket | ||
| ``` | ||
|
|
||
| Then `update_weight_utils.py` imports from `sglang_compat` — one import line instead of five, and only loaded when SGLang is the active backend. | ||
|
|
||
| **Step 2 — Abstract engine RPC calls behind a protocol** | ||
|
|
||
| The engine actors (`SGLangEngine`, `VLLMEngine`) already expose weight-sync methods but with different names/signatures. Introduce a lightweight protocol or adapter: | ||
|
|
||
| ```python | ||
| # In update_weight_utils.py or a small helper | ||
|
|
||
| def _call_engine_update_tensor(engine, backend: str, **kwargs): | ||
| """Dispatch IPC weight update to the correct engine method.""" | ||
| if backend == "vllm": | ||
| return engine.update_weights_from_tensor.remote(**kwargs) # vLLM uses same name via NcclBridge | ||
| return engine.update_weights_from_tensor.remote(**kwargs) # SGLang native | ||
|
|
||
| def _call_engine_update_distributed(engine, backend: str, **kwargs): | ||
| if backend == "vllm": | ||
| return engine.update_weights_from_distributed.remote(**kwargs) | ||
| return engine.update_weights_from_distributed.remote(**kwargs) | ||
|
|
||
| def _call_engine_init_weight_group(engine, backend: str, **kwargs): | ||
| if backend == "vllm": | ||
| return engine.init_weights_update_group.remote(**kwargs) | ||
| return engine.init_weights_update_group.remote(**kwargs) | ||
| ``` | ||
|
|
||
| > Note: Currently `VLLMEngine` already mirrors these method names (it wraps them via `NcclBridge`), so the dispatch functions may initially be identical. The value is making the indirection explicit so future method-name divergence is handled in one place. | ||
|
|
||
| **Step 3 — Lazy-import SGLang utilities only when backend is SGLang** | ||
|
|
||
| Move the `FlattenedTensorBucket` / `MultiprocessingSerializer` imports inside `UpdateWeightFromTensor.update_bucket_weights()` behind a lazy import, so the module can be loaded in a vLLM-only environment without SGLang installed. | ||
|
|
||
| #### What changes | ||
|
|
||
| | File | Change | | ||
| |---|---| | ||
| | `slime/backends/fsdp_utils/sglang_compat.py` | New shim file (~15 lines) centralizing SGLang imports | | ||
| | `slime/backends/fsdp_utils/update_weight_utils.py` | Replace 5 inline SGLang imports with one `from .sglang_compat import ...`; add backend-aware engine dispatch helpers | | ||
|
|
||
| #### What does NOT change | ||
|
|
||
| - `UpdateWeight` abstract base class — already clean. | ||
| - `UpdateWeightFromDistributed` NCCL logic — the broadcast itself is pure PyTorch; only the engine RPC dispatch gets a thin wrapper. | ||
| - Megatron-side weight sync — already has its own shim, not touched. | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it to design doc