Skip to content

Commit f4bc185

Browse files
[Docs] Minor changes related to PD disaggregation
1 parent 651c8d0 commit f4bc185

3 files changed

Lines changed: 46 additions & 27 deletions

File tree

docs/docs/concepts/gateways.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,11 @@ router:
110110

111111
</div>
112112

113-
!!! info "Policy"
113+
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
114+
115+
> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
116+
117+
??? info "Policy"
114118
The `policy` property allows you to configure the routing policy:
115119

116120
* `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue.
@@ -119,9 +123,6 @@ router:
119123
* `round_robin` &mdash; Cycles through workers in order.
120124

121125

122-
> Services using this type of gateway can run PD-disaggregated inference. To run PD disaggregation inference, refer to the [SGLang PD-Disaggregation](../../examples/inference/sglang/index.md#pd-disaggregation) example.
123-
>
124-
> Support for auto-scaling based on TTFT and ITL is coming soon.
125126

126127
### Public IP
127128

docs/docs/concepts/services.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,8 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
182182

183183
> The `scaling` property requires creating a [gateway](gateways.md).
184184

185+
<span id="replica-groups"></span>
186+
185187
??? info "Replica groups"
186188
A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules.
187189

@@ -230,8 +232,9 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
230232

231233
> Properties such as `regions`, `port`, `image`, `env` and some other cannot be configured per replica group. This support is coming soon.
232234

233-
??? info "Disaggregated serving"
234-
Replica groups support disaggregated prefill and decode, allowing both worker types to run within a single service. To run PD disaggregated inference, refer to the [SGLang PD-Disaggregation](../../examples/inference/sglang/index.md#pd-disaggregation) example.
235+
### PD disaggregation
236+
237+
If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
235238

236239
### Authorization
237240

examples/inference/sglang/README.md

Lines changed: 36 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGL
99

1010
## Apply a configuration
1111

12-
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang.
12+
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang.
1313

1414
=== "NVIDIA"
1515

@@ -108,31 +108,18 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
108108
```
109109
</div>
110110

111-
!!! info "SGLang Model Gateway"
112-
If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
111+
!!! info "Router policy"
112+
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
113113

114-
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
114+
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
115115
116-
## PD-Disaggregation
116+
## Configuration options
117117

118-
To run PD-Disaggregated inference using SGLang Model Gateway.
118+
### PD disaggregation
119119

120-
Create a SGLang-enabled gateway in the same network where prefill and decode workers will be deployed. Here we are using a Kubernetes cluster to ensure the gateway and workers share the same network.
120+
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
121121

122-
```yaml
123-
type: gateway
124-
name: gateway-name
125-
126-
backend: kubernetes
127-
region: any
128-
129-
# This domain will be used to access the endpoint
130-
domain: example.com
131-
router:
132-
type: sglang
133-
```
134-
135-
After the gateway is ready, create a node group with at least two instances—one for the Prefill worker and one for the Decode worker—within the same Kubernetes cluster where the gateway is running. Then apply below service configuration to the GPU nodes.
122+
<div editor-title="examples/inference/sglang/pd.dstack.yml">
136123

137124
```yaml
138125
type: service
@@ -189,6 +176,34 @@ router:
189176
pd_disaggregation: true
190177
```
191178
179+
</div>
180+
181+
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
182+
183+
#### Gateway
184+
185+
Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
186+
187+
For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
188+
189+
<div editor-title="gateway.dstack.yml">
190+
191+
```yaml
192+
type: gateway
193+
name: gateway-name
194+
195+
backend: kubernetes
196+
region: any
197+
198+
domain: example.com
199+
router:
200+
type: sglang
201+
```
202+
203+
</div>
204+
205+
<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
206+
192207
## Source code
193208

194209
The source-code of these examples can be found in

0 commit comments

Comments
 (0)