Skip to content

Commit 0255f94

Browse files
[Blog] SGLang router integration and disaggregated inference roadmap (#3323)
* [Blog] SGLang router integration and disaggregated inference roadmap * [Blog] SGLang router integration and disaggregated inference roadmap
1 parent 120ef05 commit 0255f94

2 files changed

Lines changed: 173 additions & 2 deletions

File tree

docs/blog/posts/sglang-router.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
title: "SGLang router integration and disaggregated inference roadmap"
3+
date: 2025-11-25
4+
description: "TBA"
5+
slug: sglang-router
6+
image: https://dstack.ai/static-assets/static-assets/images/dstack-sglang-router.png
7+
categories:
8+
- Changelog
9+
---
10+
11+
# SGLang router integration and disaggregated inference roadmap
12+
13+
[dstack](https://github.com/dstackai/dstack/) provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling.
14+
15+
<img src="https://dstack.ai/static-assets/static-assets/images/dstack-sglang-router.png" width="630"/>
16+
17+
As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on `dstack`. To move this forward, we’re introducing native integration between dstack and [SGLang’s Model Gateway](https://docs.sglang.ai/advanced_features/router.html) (formerly known as the SGLang Router).
18+
19+
<!-- more -->
20+
21+
Although `dstack` can run on Kubernetes, it differs by offering higher-level abstractions that cover the core AI use cases: [dev environments](../../docs/concepts/dev-environments.md) for development, [tasks](../../docs/concepts/tasks.md) for training, and [services](../../docs/concepts/services.md) for inference.
22+
23+
## Services
24+
25+
Here’s an example of a service:
26+
27+
=== "NVIDIA"
28+
29+
<div editor-title="qwen.dstack.yml">
30+
31+
```yaml
32+
type: service
33+
name: qwen
34+
35+
image: lmsysorg/sglang:latest
36+
env:
37+
- HF_TOKEN
38+
- MODEL_ID=qwen/qwen2.5-0.5b-instruct
39+
commands:
40+
- |
41+
python3 -m sglang.launch_server \
42+
--model-path $MODEL_ID \
43+
--port 8000 \
44+
--trust-remote-code
45+
port: 8000
46+
model: qwen/qwen2.5-0.5b-instruct
47+
48+
resources:
49+
gpu: 8GB..24GB:1
50+
```
51+
52+
</div>
53+
54+
=== "AMD"
55+
<div editor-title="qwen.dstack.yml">
56+
57+
```yaml
58+
type: service
59+
name: qwen
60+
61+
image: lmsysorg/sglang:v0.5.5.post3-rocm700-mi30x
62+
env:
63+
- HF_TOKEN
64+
- MODEL_ID=qwen/qwen2.5-0.5b-instruct
65+
commands:
66+
- |
67+
python3 -m sglang.launch_server \
68+
--model-path $MODEL_ID \
69+
--port 8000 \
70+
--trust-remote-code
71+
port: 8000
72+
model: qwen/qwen2.5-0.5b-instruct
73+
74+
resources:
75+
gpu: MI300X:1
76+
```
77+
78+
</div>
79+
80+
This service can be deployed via the following command:
81+
82+
<div class="termy">
83+
84+
```shell
85+
$ HF_TOKEN=...
86+
$ dstack apply -f qwen.dstack.yml
87+
```
88+
89+
</div>
90+
91+
This deploys the service as an OpenAI-compatible endpoint and manages provisioning and replicas automatically.
92+
93+
## Gateways
94+
95+
If you'd like to enable auto-scaling, HTTPS, or use a custom domain, create a gateway:
96+
97+
<div editor-title="gateway.dstack.yml">
98+
99+
```yaml
100+
type: gateway
101+
name: my-gateway
102+
103+
backend: aws
104+
region: eu-west-1
105+
106+
# Specify your custom domain
107+
domain: example.com
108+
```
109+
110+
</div>
111+
112+
This gateway can be created via the following command:
113+
114+
<div class="termy">
115+
116+
```shell
117+
$ dstack apply -f gateway.dstack.yml
118+
```
119+
120+
</div>
121+
122+
Once the gateway has a hostname, update your domain’s DNS settings by adding a record for `*.<gateway domain>`.
123+
124+
After that, if you configure [replicas and scaling](../../docs/concepts/services.md#replicas-and-scaling), the gateway will automatically scale the number of replicas and route traffic across them.
125+
126+
### Router
127+
128+
By default, the gateway uses its built-in load balancer to route traffic across replicas. With the latest release, you can instead delegate traffic routing to the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html) by setting the `router` property to `sglang`:
129+
130+
<div editor-title="gateway.dstack.yml">
131+
132+
```yaml
133+
type: gateway
134+
name: my-gateway
135+
136+
backend: aws
137+
region: eu-west-1
138+
139+
# Specify your custom domain
140+
domain: example.com
141+
142+
router:
143+
type: sglang
144+
policy: cache_aware
145+
```
146+
147+
</div>
148+
149+
The `policy` property allows you to configure the routing policy:
150+
151+
* `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue.
152+
* `power_of_two` &mdash; Samples two workers and picks the lighter one.
153+
* `random` &mdash; Uniform random selection.
154+
* `round_robin` &mdash; Cycles through workers in order.
155+
156+
With this integration, K/V cache reuse across replicas becomes possible — a key step toward low-latency inference. It also sets the path for full disaggregated inference and native auto-scaling. And fundamentally, it reflects our commitment to collaborating with the open-source ecosystem instead of reinventing its core components.
157+
158+
## Limitations and roadmap
159+
160+
Looking ahead, this integration also shapes our roadmap. Over the coming releases, we plan to expand support in several key areas:
161+
162+
* Enabling prefill and decode worker separation for full disaggregation (today, only standard workers are supported).
163+
* Introducing auto-scaling based on TTFT (Time to First Token) and ITL (Inter-Token Latency), complementing the current requests-per-second scaling metric.
164+
* Extending native support to more emerging inference stacks.
165+
166+
## What's next?
167+
168+
1. Check [dev environments](../../docs/concepts/dev-environments.md),
169+
[tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md),
170+
and [gateways](../../docs/concepts/gateways.md)
171+
2. Follow [Quickstart](../../docs/quickstart.md)
172+
3. Join [Discord](https://discord.gg/u8SmfwPpMd)

docs/docs/concepts/gateways.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,7 @@ router:
8585
</div>
8686

8787
!!! info "Policy"
88-
89-
The `router` property allows you to configure the routing `policy`:
88+
The `policy` property allows you to configure the routing policy:
9089

9190
* `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue.
9291
* `power_of_two` &mdash; Samples two workers and picks the lighter one.

0 commit comments

Comments
 (0)