From 347daf4b2d413ab9d2b20dad5a5c646ece6a81eb Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 25 Nov 2025 14:37:35 +0100 Subject: [PATCH 1/2] [Blog] SGLang router integration and disaggregated inference roadmap --- docs/blog/posts/sglang-router.md | 172 +++++++++++++++++++++++++++++++ docs/docs/concepts/gateways.md | 3 +- 2 files changed, 173 insertions(+), 2 deletions(-) create mode 100644 docs/blog/posts/sglang-router.md diff --git a/docs/blog/posts/sglang-router.md b/docs/blog/posts/sglang-router.md new file mode 100644 index 0000000000..e0bf2861ba --- /dev/null +++ b/docs/blog/posts/sglang-router.md @@ -0,0 +1,172 @@ +--- +title: "SGLang router integration and disaggregated inference roadmap" +date: 2025-11-25 +description: "TBA" +slug: sglang-router +image: https://dstack.ai/static-assets/static-assets/images/dstack-sglang-router.png +categories: + - Changelog +--- + +# SGLang router integration and disaggregated inference roadmap + +[dstack](https://github.com/dstackai/dstack/) provides a streamlined way to handle GPU provisioning and workload orchestration across GPU clouds, Kubernetes clusters, or on-prem environments. Built for interoperability, dstack bridges diverse hardware and open-source tooling. + + + +As disaggregated, low-latency inference emerges, we aim to ensure this new stack runs natively on `dstack`. To move this forward, we’re introducing native integration between dstack and [SGLang’s Model Gateway](https://docs.sglang.ai/advanced_features/router.html) (formerly known as the SGLang Router). + + + +Although `dstack` can run on Kubernetes, it differs by offering higher-level abstractions that cover the core AI use cases: [dev environments](../../docs/concepts/dev-environments.md) for development, [tasks](../../docs/concepts/tasks.md) for training, and [services](../../docs/concepts/services.md) for inference. + +## Services + +Here’s an example of a service: + +=== "NVIDIA" + +
+ + ```yaml + type: service + name: qwen + + image: lmsysorg/sglang:latest + env: + - HF_TOKEN + - MODEL_ID=qwen/qwen2.5-0.5b-instruct + commands: + - | + python3 -m sglang.launch_server \ + --model-path $MODEL_ID \ + --port 8000 \ + --trust-remote-code + port: 8000 + model: qwen/qwen2.5-0.5b-instruct + + resources: + gpu: 8GB..24GB:1 + ``` + +
+ +=== "AMD" +
+ + ```yaml + type: service + name: qwen + + image: lmsysorg/sglang:v0.5.5.post3-rocm700-mi30x + env: + - HF_TOKEN + - MODEL_ID=qwen/qwen2.5-0.5b-instruct + commands: + - | + python3 -m sglang.launch_server \ + --model-path $MODEL_ID \ + --port 8000 \ + --trust-remote-code + port: 8000 + model: qwen/qwen2.5-0.5b-instruct + + resources: + gpu: MI300X:1 + ``` + +
+ +This service can be deployed via the following command: + +
+ +```shell +$ HF_TOKEN=... +$ dstack apply -f qwen.dstack.yml +``` + +
+ +This deploys the service as an OpenAI-compatible endpoint and manages provisioning and replicas automatically. + +## Gateways + +If you'd like to enable auto-scaling, HTTPS, or use a custom domain, create a gateway: + +
+ + ```yaml + type: gateway + name: my-gateway + + backend: aws + region: eu-west-1 + + # Specify your custom domain + domain: example.com + ``` + +
+ +This gateway can be created via the following command: + +
+ +```shell +$ dstack apply -f gateway.dstack.yml +``` + +
+ +Once the gateway has a hostname, update your domain’s DNS settings by adding a record for `*.`. + +After that, if you configure [replicas and scaling](../../docs/concepts/services.md#replicas-and-scaling), the gateway will automatically scale the number of replicas and route traffic across them. + +### Router + +By default, the gateway uses its built-in load balancer to route traffic across replicas. With the latest release, you can instead delegate traffic routing to the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html) by setting the `router` property to `sglang`: + +
+ + ```yaml + type: gateway + name: my-gateway + + backend: aws + region: eu-west-1 + + # Specify your custom domain + domain: example.com + + router: + type: sglang + policy: cache_aware + ``` + +
+ +The `policy` property allows you to configure the routing policy: + +* `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue. +* `power_of_two` — Samples two workers and picks the lighter one. +* `random` — Uniform random selection. +* `round_robin` — Cycles through workers in order. + +With this integration, K/V cache reuse across replicas becomes possible — a key step toward low-latency inference. It also sets the path for full disaggregated inference and native auto-scaling. And fundamentally, it reflects our commitment to collaborating with the open-source ecosystem instead of reinventing its core components. + +## Limitations and roadmap + +Looking ahead, this integration also shapes our roadmap. Over the coming releases, we plan to expand support in several key areas: + +* Enabling prefill and decode worker separation for full disaggregation (today, only standard workers are supported). +* Introducing auto-scaling based on inter-token latency, rather than relying solely on request-per-second metrics. +* Extending native support to more emerging inference stacks. + +## What's next? + +1. Check [dev environments](../../docs/concepts/dev-environments.md), + [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), + and [gateways](../../docs/concepts/gateways.md) +2. Follow [Quickstart](../../docs/quickstart.md) +3. Join [Discord](https://discord.gg/u8SmfwPpMd) diff --git a/docs/docs/concepts/gateways.md b/docs/docs/concepts/gateways.md index eb433f7d36..728077addb 100644 --- a/docs/docs/concepts/gateways.md +++ b/docs/docs/concepts/gateways.md @@ -85,8 +85,7 @@ router: !!! info "Policy" - - The `router` property allows you to configure the routing `policy`: + The `policy` property allows you to configure the routing policy: * `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue. * `power_of_two` — Samples two workers and picks the lighter one. From 29df5209bf9806436e5f751790e54f4230de54d1 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 25 Nov 2025 15:43:49 +0100 Subject: [PATCH 2/2] [Blog] SGLang router integration and disaggregated inference roadmap --- docs/blog/posts/sglang-router.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/blog/posts/sglang-router.md b/docs/blog/posts/sglang-router.md index e0bf2861ba..b6a2bef836 100644 --- a/docs/blog/posts/sglang-router.md +++ b/docs/blog/posts/sglang-router.md @@ -160,7 +160,7 @@ With this integration, K/V cache reuse across replicas becomes possible — a ke Looking ahead, this integration also shapes our roadmap. Over the coming releases, we plan to expand support in several key areas: * Enabling prefill and decode worker separation for full disaggregation (today, only standard workers are supported). -* Introducing auto-scaling based on inter-token latency, rather than relying solely on request-per-second metrics. +* Introducing auto-scaling based on TTFT (Time to First Token) and ITL (Inter-Token Latency), complementing the current requests-per-second scaling metric. * Extending native support to more emerging inference stacks. ## What's next?