diff --git a/docs/docs/concepts/gateways.md b/docs/docs/concepts/gateways.md index 1435926810..03ddd10e5c 100644 --- a/docs/docs/concepts/gateways.md +++ b/docs/docs/concepts/gateways.md @@ -1,10 +1,9 @@ # Gateways -Gateways manage the ingress traffic of running [services](services.md), -provide an HTTPS endpoint mapped to your domain, handle auto-scaling and rate limits. +Gateways manage ingress traffic for running [services](services.md), handle auto-scaling and rate limits, enable HTTPS, and allow you to configure a custom domain. They also support custom routers, such as the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"}. -> If you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}, -> the gateway is already set up for you. + ## Apply a configuration @@ -57,6 +56,48 @@ You can create gateways with the `aws`, `azure`, `gcp`, or `kubernetes` backends Gateways in `kubernetes` backend require an external load balancer. Managed Kubernetes solutions usually include a load balancer. For self-hosted Kubernetes, you must provide a load balancer by yourself. +### Router + +By default, the gateway uses its own load balancer to route traffic between replicas. However, you can delegate this responsibility to a specific router by setting the `router` property. Currently, the only supported external router is `sglang`. + +#### SGLang + +The `sglang` router delegates routing logic to the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"}. + +To enable it, set `type` field under `router` to `sglang`: + +
+ +```yaml +type: gateway +name: sglang-gateway + +backend: aws +region: eu-west-1 + +domain: example.com + +router: + type: sglang + policy: cache_aware +``` + +
+ +!!! info "Policy" + + The `router` property allows you to configure the routing `policy`: + + * `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue. + * `power_of_two` — Samples two workers and picks the lighter one. + * `random` — Uniform random selection. + * `round_robin` — Cycles through workers in order. + + +> Currently, services using this type of gateway must run standard SGLang workers. See the [example](../../examples/inference/sglang/index.md). +> +> Support for prefill/decode disaggregation and auto-scaling based on inter-token latency is coming soon. + ### Public IP If you don't need/want a public IP for the gateway, you can set the `public_ip` to `false` (the default value is `true`), making the gateway private. diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index 6404c2bd1a..09ff1fba8f 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -100,12 +100,13 @@ If [authorization](#authorization) is not disabled, the service endpoint require However, you'll need a gateway in the following cases: * To use auto-scaling or rate limits + * To enable a support custom router, e.g. such as the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"} * To enable HTTPS for the endpoint and map it to your domain * If your service requires WebSockets * If your service cannot work with a [path prefix](#path-prefix) - Note, if you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}, - a gateway is already pre-configured for you. + If a [gateway](gateways.md) is configured, the service endpoint will be accessible at `https://./`. diff --git a/docs/docs/reference/dstack.yml/gateway.md b/docs/docs/reference/dstack.yml/gateway.md index 4d81d5d508..b8e2742891 100644 --- a/docs/docs/reference/dstack.yml/gateway.md +++ b/docs/docs/reference/dstack.yml/gateway.md @@ -10,6 +10,16 @@ The `gateway` configuration type allows creating and updating [gateways](../../c type: required: true +### `router` + +=== "SGLang Model Gateway" + + #SCHEMA# dstack._internal.core.models.routers.SGLangRouterConfig + overrides: + show_root_heading: false + type: + required: true + ### `certificate` === "Let's encrypt" diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md index f880ac30b7..1652b838c8 100644 --- a/examples/inference/sglang/README.md +++ b/examples/inference/sglang/README.md @@ -2,32 +2,21 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} and `dstack`. -??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples. - -
- - ```shell - $ git clone https://github.com/dstackai/dstack - $ cd dstack - ``` - -
+## Apply a configuration -## Deployment Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang. -=== "AMD" +=== "NVIDIA" -
+
```yaml type: service - name: deepseek-r1-amd + name: deepseek-r1-nvidia - image: lmsysorg/sglang:v0.4.1.post4-rocm620 + image: lmsysorg/sglang:latest env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B commands: - python3 -m sglang.launch_server @@ -36,25 +25,24 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B --trust-remote-code port: 8000 - model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B + model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B resources: - gpu: MI300x - disk: 300GB + gpu: 24GB ```
-=== "NVIDIA" +=== "AMD" -
+
```yaml type: service - name: deepseek-r1-nvidia + name: deepseek-r1-amd - image: lmsysorg/sglang:latest + image: lmsysorg/sglang:v0.4.1.post4-rocm620 env: - - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B + - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B commands: - python3 -m sglang.launch_server @@ -63,16 +51,14 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B --trust-remote-code port: 8000 - model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B resources: - gpu: 24GB + gpu: MI300x + disk: 300GB ```
- -### Applying the configuration - To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
@@ -118,8 +104,10 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ ```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint -is available at `https://gateway./`. +!!! info "SGLang Model Gateway" + If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/router.html#){:target="_blank"}, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details. + +> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the OpenAI-compatible endpoint is available at `https://gateway./`. ## Source code @@ -128,5 +116,5 @@ The source-code of this example can be found in ## What's next? -1. Check [services](https://dstack.ai/docs/services) +1. Read about [services](https://dstack.ai/docs/concepts/services) and [gateways](https://dstack.ai/docs/concepts/gateways) 2. Browse the [SgLang DeepSeek Usage](https://docs.sglang.ai/references/deepseek.html), [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html) diff --git a/src/dstack/_internal/core/models/routers.py b/src/dstack/_internal/core/models/routers.py index ec779b1242..e07631e12e 100644 --- a/src/dstack/_internal/core/models/routers.py +++ b/src/dstack/_internal/core/models/routers.py @@ -1,6 +1,9 @@ from enum import Enum from typing import Literal +from pydantic import Field +from typing_extensions import Annotated + from dstack._internal.core.models.common import CoreModel @@ -9,8 +12,13 @@ class RouterType(str, Enum): class SGLangRouterConfig(CoreModel): - type: Literal["sglang"] = "sglang" - policy: Literal["random", "round_robin", "cache_aware", "power_of_two"] = "cache_aware" + type: Annotated[Literal["sglang"], Field(description="The router type")] = "sglang" + policy: Annotated[ + Literal["random", "round_robin", "cache_aware", "power_of_two"], + Field( + description="The routing policy. Options: `random`, `round_robin`, `cache_aware`, `power_of_two`" + ), + ] = "cache_aware" AnyRouterConfig = SGLangRouterConfig