diff --git a/docs/docs/concepts/gateways.md b/docs/docs/concepts/gateways.md
index 1435926810..03ddd10e5c 100644
--- a/docs/docs/concepts/gateways.md
+++ b/docs/docs/concepts/gateways.md
@@ -1,10 +1,9 @@
# Gateways
-Gateways manage the ingress traffic of running [services](services.md),
-provide an HTTPS endpoint mapped to your domain, handle auto-scaling and rate limits.
+Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"}.
-> If you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"},
-> the gateway is already set up for you.
+
## Apply a configuration
@@ -57,6 +56,48 @@ You can create gateways with the `aws`, `azure`, `gcp`, or `kubernetes` backends
Gateways in `kubernetes` backend require an external load balancer. Managed Kubernetes solutions usually include a load balancer.
For self-hosted Kubernetes, you must provide a load balancer by yourself.
+### Router
+
+By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`.
+
+#### SGLang
+
+The `sglang` router delegates routing logic to the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"}.
+
+To enable it, set `type` field under `router` to `sglang`:
+
+
+
+```yaml
+type: gateway
+name: sglang-gateway
+
+backend: aws
+region: eu-west-1
+
+domain: example.com
+
+router:
+ type: sglang
+ policy: cache_aware
+```
+
+
+
+!!! info "Policy"
+
+ The `router` property allows you to configure the routing `policy`:
+
+ * `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue.
+ * `power_of_two` — Samples two workers and picks the lighter one.
+ * `random` — Uniform random selection.
+ * `round_robin` — Cycles through workers in order.
+
+
+> Currently, services using this type of gateway must run standard SGLang workers. See the [example](../../examples/inference/sglang/index.md).
+>
+> Support for prefill/decode disaggregation and auto-scaling based on inter-token latency is coming soon.
+
### Public IP
If you don't need/want a public IP for the gateway, you can set the `public_ip` to `false` (the default value is `true`), making the gateway private.
diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
index 6404c2bd1a..09ff1fba8f 100644
--- a/docs/docs/concepts/services.md
+++ b/docs/docs/concepts/services.md
@@ -100,12 +100,13 @@ If [authorization](#authorization) is not disabled, the service endpoint require
However, you'll need a gateway in the following cases:
* To use auto-scaling or rate limits
+ * To enable a support custom router, e.g. such as the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"}
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a [path prefix](#path-prefix)
- Note, if you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"},
- a gateway is already pre-configured for you.
+
If a [gateway](gateways.md) is configured, the service endpoint will be accessible at
`https://./`.
diff --git a/docs/docs/reference/dstack.yml/gateway.md b/docs/docs/reference/dstack.yml/gateway.md
index 4d81d5d508..b8e2742891 100644
--- a/docs/docs/reference/dstack.yml/gateway.md
+++ b/docs/docs/reference/dstack.yml/gateway.md
@@ -10,6 +10,16 @@ The `gateway` configuration type allows creating and updating [gateways](../../c
type:
required: true
+### `router`
+
+=== "SGLang Model Gateway"
+
+ #SCHEMA# dstack._internal.core.models.routers.SGLangRouterConfig
+ overrides:
+ show_root_heading: false
+ type:
+ required: true
+
### `certificate`
=== "Let's encrypt"
diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
index f880ac30b7..1652b838c8 100644
--- a/examples/inference/sglang/README.md
+++ b/examples/inference/sglang/README.md
@@ -2,32 +2,21 @@
This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} and `dstack`.
-??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
-
-
-
- ```shell
- $ git clone https://github.com/dstackai/dstack
- $ cd dstack
- ```
-
-
+## Apply a configuration
-## Deployment
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang.
-=== "AMD"
+=== "NVIDIA"
-
+
```yaml
type: service
- name: deepseek-r1-amd
+ name: deepseek-r1-nvidia
- image: lmsysorg/sglang:v0.4.1.post4-rocm620
+ image: lmsysorg/sglang:latest
env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
+ - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
commands:
- python3 -m sglang.launch_server
@@ -36,25 +25,24 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
--trust-remote-code
port: 8000
- model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
+ model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
resources:
- gpu: MI300x
- disk: 300GB
+ gpu: 24GB
```
-=== "NVIDIA"
+=== "AMD"
-
+
```yaml
type: service
- name: deepseek-r1-nvidia
+ name: deepseek-r1-amd
- image: lmsysorg/sglang:latest
+ image: lmsysorg/sglang:v0.4.1.post4-rocm620
env:
- - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+ - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
commands:
- python3 -m sglang.launch_server
@@ -63,16 +51,14 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
--trust-remote-code
port: 8000
- model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+ model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
- gpu: 24GB
+ gpu: MI300x
+ disk: 300GB
```
-
-### Applying the configuration
-
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
@@ -118,8 +104,10 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.
/`.
+!!! info "SGLang Model Gateway"
+ If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"}, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
+
+> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the OpenAI-compatible endpoint is available at `https://gateway./`.
## Source code
@@ -128,5 +116,5 @@ The source-code of this example can be found in
## What's next?
-1. Check [services](https://dstack.ai/docs/services)
+1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways)
2. Browse the [SgLang DeepSeek Usage](https://docs.sglang.ai/references/deepseek.html), [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
diff --git a/src/dstack/_internal/core/models/routers.py b/src/dstack/_internal/core/models/routers.py
index ec779b1242..e07631e12e 100644
--- a/src/dstack/_internal/core/models/routers.py
+++ b/src/dstack/_internal/core/models/routers.py
@@ -1,6 +1,9 @@
from enum import Enum
from typing import Literal
+from pydantic import Field
+from typing_extensions import Annotated
+
from dstack._internal.core.models.common import CoreModel
@@ -9,8 +12,13 @@ class RouterType(str, Enum):
class SGLangRouterConfig(CoreModel):
- type: Literal["sglang"] = "sglang"
- policy: Literal["random", "round_robin", "cache_aware", "power_of_two"] = "cache_aware"
+ type: Annotated[Literal["sglang"], Field(description="The router type")] = "sglang"
+ policy: Annotated[
+ Literal["random", "round_robin", "cache_aware", "power_of_two"],
+ Field(
+ description="The routing policy. Options: `random`, `round_robin`, `cache_aware`, `power_of_two`"
+ ),
+ ] = "cache_aware"
AnyRouterConfig = SGLangRouterConfig