You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/concepts/gateways.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -110,7 +110,11 @@ router:
110
110
111
111
</div>
112
112
113
-
!!! info "Policy"
113
+
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
114
+
115
+
> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
116
+
117
+
??? info "Policy"
114
118
The `policy` property allows you to configure the routing policy:
115
119
116
120
* `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue.
@@ -119,9 +123,6 @@ router:
119
123
* `round_robin` — Cycles through workers in order.
120
124
121
125
122
-
> Services using this type of gateway can run PD-disaggregated inference. To run PD disaggregation inference, refer to the [SGLang PD-Disaggregation](../../examples/inference/sglang/index.md#pd-disaggregation) example.
123
-
>
124
-
> Support for auto-scaling based on TTFT and ITL is coming soon.
Copy file name to clipboardExpand all lines: docs/docs/concepts/services.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -182,6 +182,8 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
182
182
183
183
> The `scaling` property requires creating a [gateway](gateways.md).
184
184
185
+
<span id="replica-groups"></span>
186
+
185
187
??? info "Replica groups"
186
188
A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules.
187
189
@@ -230,8 +232,9 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
230
232
231
233
> Properties such as `regions`, `port`, `image`, `env` and some other cannot be configured per replica group. This support is coming soon.
232
234
233
-
??? info "Disaggregated serving"
234
-
Replica groups support disaggregated prefill and decode, allowing both worker types to run within a single service. To run PD disaggregated inference, refer to the [SGLang PD-Disaggregation](../../examples/inference/sglang/index.md#pd-disaggregation) example.
235
+
### PD disaggregation
236
+
237
+
If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
111
+
!!! info "Router policy"
112
+
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
113
113
114
-
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
114
+
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, ratelimits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
115
115
116
-
## PD-Disaggregation
116
+
## Configuration options
117
117
118
-
To run PD-Disaggregated inference using SGLang Model Gateway.
118
+
### PD disaggregation
119
119
120
-
Create a SGLang-enabled gateway in the same network where prefill and decode workers will be deployed. Here we are using a Kubernetes cluster to ensure the gateway and workers share the same network.
120
+
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
121
121
122
-
```yaml
123
-
type: gateway
124
-
name: gateway-name
125
-
126
-
backend: kubernetes
127
-
region: any
128
-
129
-
# This domain will be used to access the endpoint
130
-
domain: example.com
131
-
router:
132
-
type: sglang
133
-
```
134
-
135
-
After the gateway is ready, create a node group with at least two instances—one for the Prefill worker and one for the Decode worker—within the same Kubernetes cluster where the gateway is running. Then apply below service configuration to the GPU nodes.
0 commit comments