From 0373268346337700458b4a3f2ee2ecb26f893d1e Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 30 Sep 2025 10:47:12 +0200 Subject: [PATCH 1/5] [Docs] Kubernetes guide --- docs/docs/guides/kubernetes.md | 111 +++++++++++++++++++++++++++++++++ mkdocs.yml | 7 ++- 2 files changed, 115 insertions(+), 3 deletions(-) create mode 100644 docs/docs/guides/kubernetes.md diff --git a/docs/docs/guides/kubernetes.md b/docs/docs/guides/kubernetes.md new file mode 100644 index 0000000000..548d146877 --- /dev/null +++ b/docs/docs/guides/kubernetes.md @@ -0,0 +1,111 @@ +# Kubernetes + +While `dstack` can run natively without Kubernetes on both cloud (via cloud [backends](../concepts/backends.md)) and on-prem +(via [SSH fleets](../concepts/fleets.md#ssh)), it also supports running dev environments, tasks, and services directly on Kubernetes clusters through its native integration — the `kubernetes` backend. + +## Setting up the backend + +To use the `kubernetes` backend with `dstack`, you need to configure it with the path to the kubeconfig file, the IP address of any node in the cluster, and the port that `dstack` will use for proxying SSH traffic. +This configuration is defined in the `~/.dstack/server/config.yml` file: + +
+ +```yaml +projects: +- name: main + backends: + - type: kubernetes + kubeconfig: + filename: ~/.kube/config + proxy_jump: + hostname: 204.12.171.137 + port: 32000 +``` + +
+ +### Proxy jump + +To allow the `dstack` server and CLI to access runs via SSH, `dstack` requires a node that acts as a jump host to proxy SSH traffic into containers. + +To configure this node, specify `hostname` and `port` under the `proxy_jump` property: + +- `hostname` — the IP address of any cluster node selected as the jump host. Both the `dstack` server and CLI must be able to reach it. This node can be either a GPU node or a CPU-only node — it makes no difference. +- `port` — any accessible port on that node, which `dstack` uses to forward SSH traffic. + +No additional setup is required — `dstack` configures and manages the proxy automatically. + +### NVIDIA GPU Operator + +> For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the +[NVIDIA GPU Operator :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html){:target="_blank"} pre-installed. + +After the backend is set up, you interact with `dstack` just as you would with other backends or SSH fleets. You can run dev environments, tasks, and services. + +## Fleets + +### Clusters + +If you’d like to run [distributed tasks](../concepts/tasks.md#distributed-tasks) with the `kubernetes` backend, you first need to create a fleet with `placement` set to `cluster`: + +
+ + ```yaml + type: fleet + # The name is optional; if not specified, one is generated automatically + name: my-k8s-fleet + + # For `kubernetes`, `min` should be set to `0` since it can't pre-provision VMs. + # Optionally, you can set the maximum number of nodes to limit scaling. + nodes: 0.. + + placement: cluster + + backends: [kuberenetes] + + resources: + # Specify requirements to filter nodes + gpu: 1..8 + ``` + +
+ +Then, create the fleet using the `dstack apply` command: + +
+ +```shell +$ dstack apply -f examples/misc/fleets/.dstack.yml + +Provisioning... +---> 100% + + FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED +``` + +
+ +Once the fleet is created, you can run [distributed tasks](../concepts/tasks.md#distributed-tasks). `dstack` takes care of orchestration automatically. + +For more details on clusters, see the [corresponding guide](clusters.md). + +> Fleets with `placement` set to `cluster` can be used not only for distributed tasks, but also for dev environments, single-node tasks, and services. +> Since Kubernetes clusters are interconnected by default, you can always set `placement` to `cluster`. + +!!! info "Fleets" + It’s generally recommended to create [fleets](../concepts/fleets.md) even if you don’t plan to run distributed tasks. + +## FAQ + +??? info "Is managed Kubernetes with auto-scaling supported?" + Managed Kubernetes is supported. However, the `kubernetes` backend can only run on pre-provisioned nodes. + Support for auto-scalable Kubernetes clusters is coming soon—you can track progress in the corresponding [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/3126){:target="_blank"}. + + If on-demand provisioning is important, we recommend using [cloud backends](../concepts/backends.md) instead of the `kubernetes` backend, as cloud backends already support auto-scaling. + +??? info "When should I use the Kubernetes backend?" + Choose the `kubernetes` backend if your GPUs already run on Kubernetes and your team depends on its ecosystem and tooling. + + If your priority is orchestrating cloud GPUs and Kubernetes isn’t a must, [cloud backends](../concepts/backends.md) are a better fit thanks to their native cloud integration. + + For on-prem GPUs where Kubernetes is optional, [SSH fleets](../concepts/fleets.md#ssh) provide a simpler and more lightweight alternative. diff --git a/mkdocs.yml b/mkdocs.yml index c293c86d88..ece24ac776 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -229,12 +229,13 @@ nav: - Projects: docs/concepts/projects.md - Gateways: docs/concepts/gateways.md - Guides: - - Protips: docs/guides/protips.md - - Metrics: docs/guides/metrics.md - Clusters: docs/guides/clusters.md + - Kubernetes: docs/guides/kubernetes.md - Server deployment: docs/guides/server-deployment.md - - Plugins: docs/guides/plugins.md - Troubleshooting: docs/guides/troubleshooting.md + - Metrics: docs/guides/metrics.md + - Protips: docs/guides/protips.md + - Plugins: docs/guides/plugins.md - Reference: - .dstack.yml: - dev-environment: docs/reference/dstack.yml/dev-environment.md From 7469f78744e2abd8013bab0a2e6ff85e70cdd1e3 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 30 Sep 2025 13:58:43 +0200 Subject: [PATCH 2/5] [Docs] Kubernetes guide Rework `Backends` and `Fleets` pages to reflect the changes related to Kubernetes --- docs/docs/concepts/backends.md | 336 +++++++++++++++------------------ docs/docs/concepts/fleets.md | 66 ++++--- docs/docs/guides/kubernetes.md | 8 +- 3 files changed, 197 insertions(+), 213 deletions(-) diff --git a/docs/docs/concepts/backends.md b/docs/docs/concepts/backends.md index 341585a660..56962d1d4d 100644 --- a/docs/docs/concepts/backends.md +++ b/docs/docs/concepts/backends.md @@ -1,15 +1,30 @@ # Backends -Backends allow `dstack` to manage compute across various providers. -They can be configured via `~/.dstack/server/config.yml` (or through the [project settings page](../concepts/projects.md#backends) in the UI). +Backends allow `dstack` to manage compute across various environments. +They can be configured via `~/.dstack/server/config.yml` or through the [project settings page](../concepts/projects.md#backends) in the UI. -See below for examples of backend configurations. +`dstack` supports three types of backends: -??? info "SSH fleets" - For using `dstack` with on-prem servers, no backend configuration is required. - Use [SSH fleets](../concepts/fleets.md#ssh) instead once the server is up. + * [VM-based](#vm-based) – use `dstack`'s native integration with cloud providers to provision VMs, manage clusters, and orchestrate container-based runs. + * [Container-based](#container-based) – use either `dstack`'s native integration with cloud providers or Kubernetes to orchestrate container-based runs; provisioning in this case is delegated to the cloud provider or Kubernetes. + * [On-prem](#on-prem) – use `dstack`'s native support for on-prem servers without needing Kubernetes. -## AWS +??? info "dstack Sky" + If you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}, + you can either configure your own backends or use the pre-configured backend that gives you access to compute from the GPU marketplace. + +See the examples of backend configuration below. + +## VM-based + +VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers. +Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand. + +Compared to [container-based](#container-based) backends, this approach offers finer-grained, simpler control over cluster provisioning and eliminates the dependency on a Kubernetes layer. + + + +### AWS There are two ways to configure AWS: using an access key or using the default credentials. @@ -245,7 +260,7 @@ There are two ways to configure AWS: using an access key or using the default cr * (For NVIDIA instances) NVIDIA/CUDA drivers and NVIDIA Container Toolkit are installed * The firewall (`iptables`, `ufw`, etc.) must allow external traffic to port 22 and all traffic within the private subnet, and should forbid any other incoming external traffic. -## Azure +### Azure There are two ways to configure Azure: using a client secret or using the default credentials. @@ -396,7 +411,7 @@ There are two ways to configure Azure: using a client secret or using the defaul Using private subnets assumes that both the `dstack` server and users can access the configured VPC's private subnets. Additionally, private subnets must have outbound internet connectivity provided by [NAT Gateway or other mechanism](https://learn.microsoft.com/en-us/azure/nat-gateway/nat-overview). -## GCP +### GCP There are two ways to configure GCP: using a service account or using the default credentials. @@ -580,7 +595,7 @@ gcloud projects list --format="json(projectId)" Using private subnets assumes that both the `dstack` server and users can access the configured VPC's private subnets. Additionally, [Cloud NAT](https://cloud.google.com/nat/docs/overview) must be configured to provide access to external resources for provisioned instances. -## Lambda +### Lambda Log into your [Lambda Cloud :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/service/gpu-cloud) account, click API keys in the sidebar, and then click the `Generate API key` button to create a new API key. @@ -601,7 +616,7 @@ projects: -## Nebius +### Nebius Log into your [Nebius AI Cloud :material-arrow-top-right-thin:{ .external }](https://console.eu.nebius.com/) account, navigate to Access, and select Service Accounts. Create a service account, add it to the editors group, and upload its authorized key. @@ -669,66 +684,7 @@ projects: Nebius is only supported if `dstack server` is running on Python 3.10 or higher. - -## RunPod - -Log into your [RunPod :material-arrow-top-right-thin:{ .external }](https://www.runpod.io/console/) console, click Settings in the sidebar, expand the `API Keys` section, and click -the button to create a Read & Write key. - -Then proceed to configuring the backend. - -
- -```yaml -projects: - - name: main - backends: - - type: runpod - creds: - type: api_key - api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9 -``` - -
- -??? info "Community Cloud" - By default, `dstack` considers instance offers from both the Secure Cloud and the - [Community Cloud :material-arrow-top-right-thin:{ .external }](https://docs.runpod.io/references/faq/#secure-cloud-vs-community-cloud). - - You can tell them apart by their regions. - Secure Cloud regions contain datacenter IDs such as `CA-MTL-3`. - Community Cloud regions contain country codes such as `CA`. - -
- - ```shell - $ dstack apply -f .dstack.yml -b runpod - - # BACKEND REGION INSTANCE SPOT PRICE - 1 runpod CA NVIDIA A100 80GB PCIe yes $0.6 - 2 runpod CA-MTL-3 NVIDIA A100 80GB PCIe yes $0.82 - ``` - -
- - If you don't want to use the Community Cloud, set `community_cloud: false` in the backend settings. - -
- - ```yaml - projects: - - name: main - backends: - - type: runpod - creds: - type: api_key - api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9 - community_cloud: false - ``` - -
- -## Vultr +### Vultr Log into your [Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/) account, click `Account` in the sidebar, select `API`, find the `Personal Access Token` panel and click the `Enable API` button. In the `Access Control` panel, allow API requests from all addresses or from the subnet where your `dstack` server is deployed. @@ -748,53 +704,7 @@ projects: -## Vast.ai - -Log into your [Vast.ai :material-arrow-top-right-thin:{ .external }](https://cloud.vast.ai/) account, click Account in the sidebar, and copy your -API Key. - -Then, go ahead and configure the backend: - -
- -```yaml -projects: -- name: main - backends: - - type: vastai - creds: - type: api_key - api_key: d75789f22f1908e0527c78a283b523dd73051c8c7d05456516fc91e9d4efd8c5 -``` - -
- -Also, the `vastai` backend supports on-demand instances only. Spot instance support coming soon. - - - -## CUDO +### CUDO Log into your [CUDO Compute :material-arrow-top-right-thin:{ .external }](https://compute.cudo.org/) account, click API keys in the sidebar, and click the `Create an API key` button. @@ -815,7 +725,7 @@ projects: -## OCI +### OCI There are two ways to configure OCI: using client credentials or using the default credentials. @@ -889,7 +799,7 @@ There are two ways to configure OCI: using client credentials or using the defau compartment_id: ocid1.compartment.oc1..aaaaaaaa ``` -## DataCrunch +### DataCrunch Log into your [DataCrunch :material-arrow-top-right-thin:{ .external }](https://cloud.datacrunch.io/) account, click Keys in the sidebar, find `REST API Credentials` area and then click the `Generate Credentials` button. @@ -910,7 +820,7 @@ projects: -## AMD Developer Cloud +### AMD Developer Cloud Log into your [AMD Developer Cloud :material-arrow-top-right-thin:{ .external }](https://amd.digitalocean.com/login) account. Click `API` in the sidebar and click the button `Generate New Token`. Then, go ahead and configure the backend: @@ -944,7 +854,7 @@ projects: * `ssh_key` - create, read, update, delete -## Digital Ocean +### Digital Ocean Log into your [Digital Ocean :material-arrow-top-right-thin:{ .external }](https://cloud.digitalocean.com/login) account. Click `API` in the sidebar and click the button `Generate New Token`. Then, go ahead and configure the backend: @@ -977,7 +887,7 @@ projects: * `sizes` - read * `ssh_key` - create, read, update,delete -## Hot Aisle +### Hot Aisle Log in to the SSH TUI as described in the [Hot Aisle Quick Start :material-arrow-top-right-thin:{ .external }](https://hotaisle.xyz/quick-start/). Create a new team and generate an API key for the member in the team. @@ -1006,7 +916,7 @@ projects: * **Operator role for the team** - Required for managing virtual machines within the team -## CloudRift +### CloudRift Log into your [CloudRift :material-arrow-top-right-thin:{ .external }](https://console.cloudrift.ai/) console, click `API Keys` in the sidebar and click the button to create a new API key. @@ -1028,60 +938,101 @@ projects: -## Kubernetes +## Container-based -To configure a Kubernetes backend, specify the path to the kubeconfig file, -and the port that `dstack` can use for proxying SSH traffic. -In case of a self-managed cluster, also specify the IP address of any node in the cluster. +Container-based backends allow `dstack` to orchestrate container-based runs either directly on cloud providers that support containers or on Kubernetes. +In this case, `dstack` delegates provisioning to the cloud provider or Kubernetes. -[//]: # (TODO: Mention that the Kind context has to be selected via `current-context` ) +Compared to [VM-based](#vm-based) backends, they offer less fine-grained control over provisioning but rely on the native logic of the underlying environment, whether that’s a cloud provider or Kubernetes. -=== "Self-managed" + - Here's how to configure the backend to use a self-managed cluster. +### Kubernetes -
+Regardless of whether it’s on-prem Kubernetes or managed, `dstack` can orchestrate container-based runs across your clusters. - ```yaml - projects: - - name: main - backends: - - type: kubernetes - kubeconfig: - filename: ~/.kube/config - proxy_jump: - hostname: localhost # The external IP address of any node - port: 32000 # Any port accessible outside of the cluster - ``` +To use the `kubernetes` backend with `dstack`, you need to configure it with the path to the kubeconfig file, the IP address of any node in the cluster, and the port that `dstack` will use for proxying SSH traffic. -
+
- The port specified to `port` must be accessible outside of the cluster. +```yaml +projects: +- name: main + backends: + - type: kubernetes + kubeconfig: + filename: ~/.kube/config + proxy_jump: + hostname: 204.12.171.137 + port: 32000 +``` - ??? info "Kind" - If you are using [Kind](https://kind.sigs.k8s.io/), make sure to make - to set up `port` via `extraPortMappings` for proxying SSH traffic: - - ```yaml - kind: Cluster - apiVersion: kind.x-k8s.io/v1alpha4 - nodes: - - role: control-plane - extraPortMappings: - - containerPort: 32000 # Must be same as `port` - hostPort: 32000 # Must be same as `port` - ``` +
+ +??? info "Proxy jump" + To allow the `dstack` server and CLI to access runs via SSH, `dstack` requires a node that acts as a jump host to proxy SSH traffic into containers. + + To configure this node, specify `hostname` and `port` under the `proxy_jump` property: + + - `hostname` — the IP address of any cluster node selected as the jump host. Both the `dstack` server and CLI must be able to reach it. This node can be either a GPU node or a CPU-only node — it makes no difference. + - `port` — any accessible port on that node, which `dstack` uses to forward SSH traffic. + + No additional setup is required — `dstack` configures and manages the proxy automatically. + +??? info "NVIDIA GPU Operator" + For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the + [NVIDIA GPU Operator :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html){:target="_blank"} pre-installed. + + - ```shell - kind create cluster --config examples/misc/kubernetes/kind-config.yml - ``` +> To learn more, see the [Kubernetes](../guides/kubernetes.md) guide. + +### RunPod + +Log into your [RunPod :material-arrow-top-right-thin:{ .external }](https://www.runpod.io/console/) console, click Settings in the sidebar, expand the `API Keys` section, and click +the button to create a Read & Write key. + +Then proceed to configuring the backend. + +
-[//]: # (TODO: Elaborate on the Kind's IP address on Linux) +```yaml +projects: + - name: main + backends: + - type: runpod + creds: + type: api_key + api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9 +``` -=== "Managed" - Here's how to configure the backend to use a managed cluster (AWS, GCP, Azure). +
+ +??? info "Community Cloud" + By default, `dstack` considers instance offers from both the Secure Cloud and the + [Community Cloud :material-arrow-top-right-thin:{ .external }](https://docs.runpod.io/references/faq/#secure-cloud-vs-community-cloud). + + You can tell them apart by their regions. + Secure Cloud regions contain datacenter IDs such as `CA-MTL-3`. + Community Cloud regions contain country codes such as `CA`. + +
+ + ```shell + $ dstack apply -f .dstack.yml -b runpod + + # BACKEND REGION INSTANCE SPOT PRICE + 1 runpod CA NVIDIA A100 80GB PCIe yes $0.6 + 2 runpod CA-MTL-3 NVIDIA A100 80GB PCIe yes $0.82 + ``` + +
+ + If you don't want to use the Community Cloud, set `community_cloud: false` in the backend settings.
@@ -1089,40 +1040,49 @@ In case of a self-managed cluster, also specify the IP address of any node in th projects: - name: main backends: - - type: kubernetes - kubeconfig: - filename: ~/.kube/config - proxy_jump: - port: 32000 # Any port accessible outside of the cluster + - type: runpod + creds: + type: api_key + api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9 + community_cloud: false ```
- The port specified to `port` must be accessible outside of the cluster. +### Vast.ai - ??? info "EKS" - For example, if you are using EKS, make sure to add it via an ingress rule - of the corresponding security group: - - ```shell - aws ec2 authorize-security-group-ingress --group-id --protocol tcp --port 32000 --cidr 0.0.0.0/0 - ``` +Log into your [Vast.ai :material-arrow-top-right-thin:{ .external }](https://cloud.vast.ai/) account, click Account in the sidebar, and copy your +API Key. -[//]: # (TODO: Elaborate on gateways, and what backends allow configuring them) +Then, go ahead and configure the backend: -[//]: # (TODO: Should we automatically detect ~/.kube/config) +
-??? info "NVIDIA GPU Operator" - To use GPUs with Kubernetes, the cluster must be installed with the - [NVIDIA GPU Operator :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html). +```yaml +projects: +- name: main + backends: + - type: vastai + creds: + type: api_key + api_key: d75789f22f1908e0527c78a283b523dd73051c8c7d05456516fc91e9d4efd8c5 +``` + +
+ +Also, the `vastai` backend supports on-demand instances only. Spot instance support coming soon. + +## On-prem + +In on-prem environments, the [Kubernetes](#kubernetes) backend can be used if a Kubernetes cluster is already set up and configured. +However, often [SSH fleets](../concepts/fleets.md#ssh) are a simpler and lighter alternative. - [//]: # (TODO: Provide short yet clear instructions. Elaborate on whether it works with Kind.) +### SSH fleets -## dstack Sky +SSH fleets require no backend configuration. +All you need to do is [provide hostnames and SSH credentials](../concepts/fleets.md#ss), and `dstack` sets up a fleet that can orchestrate container-based runs on your servers. -If you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}, -backends come pre-configured to use compute from the dstack marketplace. However, you can update the configuration via UI -to use your own cloud accounts instead. +> SSH fleets support the same features as [VM-based](#vm-based) backends. !!! info "What's next" 1. See the [`~/.dstack/server/config.yml`](../reference/server/config.yml.md) reference diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md index 406763d189..cd49ff707d 100644 --- a/docs/docs/concepts/fleets.md +++ b/docs/docs/concepts/fleets.md @@ -1,19 +1,19 @@ # Fleets -Fleets are groups of instances used to run dev environments, tasks, and services. -Depending on the fleet configuration, instances can be interconnected clusters or standalone instances. +Fleets act both as pools of instances and as templates for how those instances are provisioned. `dstack` supports two kinds of fleets: -* [Cloud fleets](#cloud) – dynamically provisioned through configured backends -* [SSH fleets](#ssh) – created using on-prem servers +* [Standard fleets](#standard) – dynamically provisioned through configured backends; they are supported with any type of backends: [VM-based](backends.md#vm-based), [container-based](backends.md#container-based), and [Kubernetes](backends.md#kubernetes) +* [SSH fleets](#ssh) – created using on-prem servers; do not require backends -## Cloud fleets { #cloud } +## Standard fleets { #standard } -When you call `dstack apply` to run a dev environment, task, or service, `dstack` reuses `idle` instances -from an existing fleet. If none match the requirements, `dstack` creates a new cloud fleet. +When you run `dstack apply` to start a dev environment, task, or service, `dstack` will reuse idle instances +from an existing fleet whenever available. -For greater control over cloud fleet provisioning, create fleets explicitly using configuration files. +If no fleet meets the requirements or has idle capacity, `dstack` can create a new fleet on the fly. +However, it’s generally better to define fleets explicitly in configuration files for greater control. ### Apply a configuration @@ -27,7 +27,7 @@ Define a fleet configuration as a YAML file in your project directory. The file # The name is optional, if not specified, generated randomly name: my-fleet - # Specify the number of instances + # Can be a range or a fixed number nodes: 2 # Uncomment to ensure instances are inter-connected #placement: cluster @@ -57,6 +57,30 @@ Provisioning... Once the status of instances changes to `idle`, they can be used by dev environments, tasks, and services. +??? info "Container-based backends" + [Container-based](backends.md#container-based) backends don’t support pre-provisioning, + so `nodes` can only be set to a range starting with `0`. + + This means instances are created only when a run starts, and once it finishes, they’re terminated and released back to the provider (either a cloud service or Kubernetes). + +
+ + ```yaml + type: fleet + # The name is optional, if not specified, generated randomly + name: my-fleet + + # Specify the number of instances + nodes: 0..2 + # Uncomment to ensure instances are inter-connected + #placement: cluster + + resources: + gpu: 24GB + ``` + +
+ ### Configuration options #### Nodes { #nodes } @@ -71,30 +95,30 @@ type: fleet name: my-fleet nodes: - min: 1 # Always maintain at least 1 instance - target: 2 # Provision 2 instances initially - max: 3 # Do not allow more than 3 instances + min: 1 # Always maintain at least 1 idle instance. Can be 0. + target: 2 # (Optional) Provision 2 instances initially + max: 3 # (Optional) Do not allow more than 3 instances ``` `dstack` ensures the fleet always has at least `nodes.min` instances, creating new instances in the background if necessary. If you don't need to keep instances in the fleet forever, you can set `nodes.min` to `0`. By default, `dstack apply` also provisions `nodes.min` instances. The `nodes.target` property allows provisioning more instances initially than needs to be maintained. -#### Placement { #cloud-placement } +#### Placement { #standard-placement } To ensure instances are interconnected (e.g., for [distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`. This ensures all instances are provisioned with optimal inter-node connectivity. ??? info "AWS" - When you create a cloud fleet with AWS, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type. + When you create a fleet with AWS, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type. Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. Otherwise, instances are only connected by the default VPC subnet. Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details. ??? info "GCP" - When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured. + When you create a fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured. !!! info "Backend configuration" Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration. @@ -102,7 +126,7 @@ This ensures all instances are provisioned with optimal inter-node connectivity. [A3 High](../../examples/clusters/a3high/index.md) examples for more details. ??? info "Nebius" - When you create a cloud fleet with Nebius, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type. + When you create a fleet with Nebius, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type. Otherwise, instances are only connected by the default VPC subnet. An InfiniBand fabric for the cluster is selected automatically. If you prefer to use some specific fabrics, configure them in the @@ -113,6 +137,8 @@ backends. > For more details on optimal inter-node connectivity, read the [Clusters](../guides/clusters.md) guide. + + #### Resources When you specify a resource value like `cpu` or `memory`, @@ -163,9 +189,9 @@ and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10 > If you’re unsure which offers (hardware configurations) are available from the configured backends, use the > [`dstack offer`](../reference/cli/dstack/offer.md#list-gpu-offers) command to list them. -#### Blocks { #cloud-blocks } +#### Blocks { #standard-blocks } -For cloud fleets, `blocks` function the same way as in SSH fleets. +For standard fleets, `blocks` function the same way as in SSH fleets. See the [`Blocks`](#ssh-blocks) section under SSH fleets for details on the blocks concept.
@@ -244,10 +270,8 @@ retry:
-> Cloud fleets are supported by all backends except `kubernetes`, `vastai`, and `runpod`. - !!! info "Reference" - Cloud fleets support many more configuration options, + Standard fleets support many more configuration options, incl. [`backends`](../reference/dstack.yml/fleet.md#backends), [`regions`](../reference/dstack.yml/fleet.md#regions), [`max_price`](../reference/dstack.yml/fleet.md#max_price), and diff --git a/docs/docs/guides/kubernetes.md b/docs/docs/guides/kubernetes.md index 548d146877..b55d01022e 100644 --- a/docs/docs/guides/kubernetes.md +++ b/docs/docs/guides/kubernetes.md @@ -1,7 +1,7 @@ # Kubernetes -While `dstack` can run natively without Kubernetes on both cloud (via cloud [backends](../concepts/backends.md)) and on-prem -(via [SSH fleets](../concepts/fleets.md#ssh)), it also supports running dev environments, tasks, and services directly on Kubernetes clusters through its native integration — the `kubernetes` backend. +While `dstack` can orchestrate container-based runs natively on both cloud and on-prem without Kubernetes (see [backends](../concepts/backends.md)), it also supports running container-based workloads directly on Kubernetes clusters. +For that, you need to configure the `kubernetes` backend. ## Setting up the backend @@ -101,11 +101,11 @@ For more details on clusters, see the [corresponding guide](clusters.md). Managed Kubernetes is supported. However, the `kubernetes` backend can only run on pre-provisioned nodes. Support for auto-scalable Kubernetes clusters is coming soon—you can track progress in the corresponding [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/3126){:target="_blank"}. - If on-demand provisioning is important, we recommend using [cloud backends](../concepts/backends.md) instead of the `kubernetes` backend, as cloud backends already support auto-scaling. + If on-demand provisioning is important, we recommend using [VM-based](../concepts/backends.md#vm-based) backends as they already support auto-scaling. ??? info "When should I use the Kubernetes backend?" Choose the `kubernetes` backend if your GPUs already run on Kubernetes and your team depends on its ecosystem and tooling. - If your priority is orchestrating cloud GPUs and Kubernetes isn’t a must, [cloud backends](../concepts/backends.md) are a better fit thanks to their native cloud integration. + If your priority is orchestrating cloud GPUs and Kubernetes isn’t a must, [VM-based](../concepts/backends.md#vm-based) backends are a better fit thanks to their native cloud integration. For on-prem GPUs where Kubernetes is optional, [SSH fleets](../concepts/fleets.md#ssh) provide a simpler and more lightweight alternative. From 2b4c012f43fef010cfbfec68141f55861ceb26aa Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 30 Sep 2025 18:53:56 +0200 Subject: [PATCH 3/5] [Docs] Improve Kubernetes documentation Updated `README`, `Overview`, `Installation` --- README.md | 18 ++++++++++-------- docs/docs/concepts/backends.md | 4 ++-- docs/docs/index.md | 7 ++++--- docs/docs/installation/index.md | 17 +++++++++-------- docs/overrides/home.html | 19 ++++++++++--------- 5 files changed, 35 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index 1f8ec4296f..540ea738d7 100644 --- a/README.md +++ b/README.md @@ -14,9 +14,11 @@ -`dstack` provides a unified control plane for running development, training, and inference on GPUs — across cloud VMs, Kubernetes, or on-prem clusters. It helps your team avoid vendor lock-in and reduce GPU costs. +`dstack` is a unified control plane for GPU provisioning and orchestration that works with any GPU cloud, Kubernetes, or on-prem clusters. -#### Accelerators +It streamlines development, training, and inference, and is compatible with any hardware, open-source tools, and frameworks. + +#### Hardware `dstack` supports `NVIDIA`, `AMD`, `Google TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box. @@ -44,15 +46,15 @@ #### Set up the server -##### (Optional) Configure backends +##### Configure backends + +To orchestrate compute across cloud providers or existing Kubernetes clusters, you need to configure backends. -To use `dstack` with cloud providers, configure backends -via the `~/.dstack/server/config.yml` file. +Backends can be set up in `~/.dstack/server/config.yml` or through the [project settings page](../concepts/projects.md#backends) in the UI. -For more details on how to configure backends, check [Backends](https://dstack.ai/docs/concepts/backends). +For more details, see [Backends](../concepts/backends.md). -> For using `dstack` with on-prem servers, create [SSH fleets](https://dstack.ai/docs/concepts/fleets#ssh) -> once the server is up. +> When using `dstack` with on-prem servers, backend configuration isn’t required. Simply create [SSH fleets](../concepts/fleets.md#ssh) once the server is up. ##### Start the server diff --git a/docs/docs/concepts/backends.md b/docs/docs/concepts/backends.md index 56962d1d4d..78b5509eef 100644 --- a/docs/docs/concepts/backends.md +++ b/docs/docs/concepts/backends.md @@ -1080,9 +1080,9 @@ However, often [SSH fleets](../concepts/fleets.md#ssh) are a simpler and lighter ### SSH fleets SSH fleets require no backend configuration. -All you need to do is [provide hostnames and SSH credentials](../concepts/fleets.md#ss), and `dstack` sets up a fleet that can orchestrate container-based runs on your servers. +All you need to do is [provide hostnames and SSH credentials](../concepts/fleets.md#ssh), and `dstack` sets up a fleet that can orchestrate container-based runs on your servers. -> SSH fleets support the same features as [VM-based](#vm-based) backends. +SSH fleets support the same features as [VM-based](#vm-based) backends. !!! info "What's next" 1. See the [`~/.dstack/server/config.yml`](../reference/server/config.yml.md) reference diff --git a/docs/docs/index.md b/docs/docs/index.md index c52ca6b94c..aaea12b1c3 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -1,9 +1,10 @@ # What is dstack? -`dstack` is an open-source container orchestrator that simplifies workload orchestration -and drives GPU utilization for ML teams. It works with any GPU cloud, on-prem cluster, or accelerated hardware. +`dstack` is a unified control plane for GPU provisioning and orchestration that works with any GPU cloud, Kubernetes, or on-prem clusters. -#### Accelerators +It streamlines development, training, and inference, and is compatible with any hardware, open-source tools, and frameworks. + +#### Hardware `dstack` supports `NVIDIA`, `AMD`, `TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box. diff --git a/docs/docs/installation/index.md b/docs/docs/installation/index.md index 85819ba9d3..b91a96b95f 100644 --- a/docs/docs/installation/index.md +++ b/docs/docs/installation/index.md @@ -1,20 +1,21 @@ # Installation -> If you don't want to host the `dstack` server (or want to access GPU marketplace), -> skip installation and proceed to [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}. +!!! info "dstack Sky" + If you don't want to host the `dstack` server (or want to access GPU marketplace), + skip installation and proceed to [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}. ## Set up the server -### (Optional) Configure backends +### Configure backends -Backends allow `dstack` to manage compute across various providers. -They can be configured via `~/.dstack/server/config.yml` (or through the [project settings page](../concepts/projects.md#backends) in the UI). +To orchestrate compute across cloud providers or existing Kubernetes clusters, you need to configure backends. -For more details on how to configure backends, check [Backends](../concepts/backends.md). +Backends can be set up in `~/.dstack/server/config.yml` or through the [project settings page](../concepts/projects.md#backends) in the UI. + +For more details, see [Backends](../concepts/backends.md). ??? info "SSH fleets" - For using `dstack` with on-prem servers, create [SSH fleets](../concepts/fleets.md#ssh) - once the server is up. + When using `dstack` with on-prem servers, backend configuration isn’t required. Simply create [SSH fleets](../concepts/fleets.md#ssh) once the server is up. ### Start the server diff --git a/docs/overrides/home.html b/docs/overrides/home.html index 8304744c5f..1dd9df6f4f 100644 --- a/docs/overrides/home.html +++ b/docs/overrides/home.html @@ -53,9 +53,9 @@

The orchestration layer for modern ML teams

- dstack provides a unified control plane for running development, training, and inference - on GPUs — across cloud VMs, Kubernetes, or on-prem clusters. It helps your team avoid vendor lock-in and reduce GPU - costs. + dstack provides ML teams with a unified control plane for GPU provisioning and orchestration + across cloud, Kubernetes, and on-prem. It streamlines development, training, and inference — reducing costs 3–7x and + preventing lock-in.

@@ -83,14 +83,15 @@

The orchestration layer for modern ML teams

-

One control plane for all your GPUs

+

An open platform for GPU orchestration

- Instead of wrestling with complex Helm charts and Kubernetes operators, dstack provides a simple, declarative way to - manage clusters, containerized dev environments, training, and inference. + Managing AI infrastructure requires efficient GPU orchestration, whether workloads run + on a single GPU cloud, across multiple GPU providers, or on-prem clusters.

-

This container-native interface makes your team more productive and your GPU usage more efficient—leading to lower - costs and faster iteration. +

+ dstack provides an open stack for GPU orchestration that streamlines development, training, + and inference, and can be used with any hardware, open-source tools, and frameworks.

@@ -219,7 +220,7 @@

Easy to use with on-prem clusters

- + SSH fleets Date: Wed, 1 Oct 2025 20:06:15 +0200 Subject: [PATCH 4/5] [Docs] Improve Kubernetes documentation Minor updates, incl. the description of `Default image`, and `privileged` for NCCL tests --- docs/docs/concepts/dev-environments.md | 8 ++++---- docs/docs/concepts/services.md | 4 ++-- docs/docs/concepts/tasks.md | 6 ++---- examples/clusters/nccl-tests/.dstack.yml | 3 +++ examples/clusters/nccl-tests/README.md | 15 ++++++++------- 5 files changed, 19 insertions(+), 17 deletions(-) diff --git a/docs/docs/concepts/dev-environments.md b/docs/docs/concepts/dev-environments.md index 8a8512e594..3ea8da78c5 100644 --- a/docs/docs/concepts/dev-environments.md +++ b/docs/docs/concepts/dev-environments.md @@ -133,7 +133,7 @@ The `gpu` property lets you specify vendor, model, memory, and count — e.g., ` If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. -??? info "Google Cloud TPU" + ??? info "Shared memory" If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure @@ -159,8 +159,8 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. #### Default image -If you don't specify `image`, `dstack` uses its base Docker image pre-configured with -`uv`, `python`, `pip`, essential CUDA drivers, and NCCL tests (under `/opt/nccl-tests/build`). +If you don't specify `image`, `dstack` uses its [base :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/tree/master/docker/base){:target="_blank"} Docker image pre-configured with + `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). Set the `python` property to pre-install a specific version of Python. diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index e8bef93ec9..cb2649e00b 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -433,8 +433,8 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. #### Default image -If you don't specify `image`, `dstack` uses its base Docker image pre-configured with -`uv`, `python`, `pip`, essential CUDA drivers, and NCCL tests (under `/opt/nccl-tests/build`). +If you don't specify `image`, `dstack` uses its [base :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/tree/master/docker/base){:target="_blank"} Docker image pre-configured with + `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). Set the `python` property to pre-install a specific version of Python. diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md index add535ecdb..41c1749f85 100644 --- a/docs/docs/concepts/tasks.md +++ b/docs/docs/concepts/tasks.md @@ -229,8 +229,6 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. - ```yaml type: task name: train @@ -259,8 +257,8 @@ If vendor is omitted, `dstack` infers it from the model or defaults to `nvidia`. #### Default image -If you don't specify `image`, `dstack` uses its base Docker image pre-configured with -`uv`, `python`, `pip`, essential CUDA drivers, and NCCL tests (under `/opt/nccl-tests/build`). +If you don't specify `image`, `dstack` uses its [base :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/tree/master/docker/base){:target="_blank"} Docker image pre-configured with + `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). Set the `python` property to pre-install a specific version of Python. diff --git a/examples/clusters/nccl-tests/.dstack.yml b/examples/clusters/nccl-tests/.dstack.yml index 164148b3c7..4232e60a9e 100644 --- a/examples/clusters/nccl-tests/.dstack.yml +++ b/examples/clusters/nccl-tests/.dstack.yml @@ -21,6 +21,9 @@ commands: sleep infinity fi +# Uncomment if the `kubernetes` backend requires it for `/dev/infiniband` access +#privileged: true + resources: gpu: nvidia:1..8 shm_size: 16GB diff --git a/examples/clusters/nccl-tests/README.md b/examples/clusters/nccl-tests/README.md index 7e5a88d64f..b08ea50ea4 100644 --- a/examples/clusters/nccl-tests/README.md +++ b/examples/clusters/nccl-tests/README.md @@ -33,6 +33,9 @@ commands: sleep infinity fi +# Uncomment if the `kubernetes` backend requires it for `/dev/infiniband` access +#privileged: true + resources: gpu: nvidia:1..8 shm_size: 16GB @@ -40,14 +43,12 @@ resources:
- +!!! info "Default image" + If you don't specify `image`, `dstack` uses its [base :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/tree/master/docker/base){:target="_blank"} Docker image pre-configured with + `uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`). -!!! info "Docker image" - The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for - [AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also - be used with regular TCP/IP network adapters and InfiniBand. - - See the [source code :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/docker/efa) for the image. +!!! info "Privileged" + In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand). ### Apply a configuration From 25f431733872c6efd3346d9b182304a0b21ff7a1 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Thu, 2 Oct 2025 12:24:44 +0200 Subject: [PATCH 5/5] [Docs] Improve Kubernetes documentation Updated `FAQ` --- docs/overrides/home.html | 124 ++++++++++++++++++++++++++------------- 1 file changed, 82 insertions(+), 42 deletions(-) diff --git a/docs/overrides/home.html b/docs/overrides/home.html index 1dd9df6f4f..ebf302f098 100644 --- a/docs/overrides/home.html +++ b/docs/overrides/home.html @@ -438,81 +438,121 @@

Co-Founder @CUDO Compute

FAQ

- +
+
+
+ How does dstack differ from Slurm? +
+
+
+ +
+

+ dstack fully replaces Slurm. Its + tasks cover job submission, queuing, retries, GPU + health checks, and scheduling for single-node and distributed runs. +

+ +

+ Beyond job scheduling, dstack adds + dev environments for interactive work, + services for production endpoints, and + fleets that give fine-grained control over + cluster provisioning and placement. +

+ +

+ You get one platform for development, training, and deployment across cloud, Kubernetes, and + on-prem. +

+
+
+
How does dstack compare to Kubernetes?
- +
-

Kubernetes is a widely used container orchestrator designed for general-purpose deployments. - To efficiently support GPU workloads, Kubernetes typically requires custom operators, and it - may not offer the most intuitive interface for ML engineers.

- -

dstack takes a different approach, focusing on container - orchestration specifically for AI - workloads, with the goal of making life easier for ML engineers.

- -

Designed to be lightweight, dstack provides a simpler, more - intuitive interface for - development, - training, and inference. It also enables more flexible and cost-effective provisioning - and management of clusters.

- -

For optimal flexibility, dstack and Kubernetes can complement - each other: dstack can handle - development, while Kubernetes manages production deployments.

+

+ Kubernetes is a general-purpose container orchestrator. dstack also + orchestrates containers, but it provides a lightweight and streamlined interface that is purpose + built for ML. +

+ +

+ You declare + dev environments, + tasks, + services, and + fleets + with simple configuration. dstack provisions GPUs, manages clusters via fleets with fine-grained + controls, and optimizes cost and utilization, while keeping a simple UI and CLI. +

+ +

+ If you already use Kubernetes, you can run dstack on it via the Kubernetes backend. +

- +
- How does dstack differ from Slurm? + Can I use dstack with Kubernetes?
- +

- Slurm excels at job scheduling across pre-configured clusters. + Yes. You can connect existing Kubernetes clusters using the Kubernetes backend and run + dev environments, + tasks, and + services on it. + Choose the Kubernetes backend if your GPUs already run on Kubernetes and your team depends on its + ecosystem and tooling. + See the + Kubernetes guide for setup and best practices.

- -

dstack goes beyond scheduling, providing a full suite of - features tailored to ML teams, - including cluster management, dynamic compute provisioning, development environments, and - advanced monitoring. This makes dstack a more comprehensive - solution for AI workloads, - whether in the cloud or on-prem. +

+ If your priority is orchestrating cloud GPUs and Kubernetes isn’t a must, VM-based backends are a better fit + thanks to their native cloud integration. + For on-prem GPUs where Kubernetes is optional, SSH fleets provide a simpler and more lightweight alternative.

- +
When should I use dstack?
- +

- dstack is designed for ML teams aiming to speed up development while reducing GPU costs - across top cloud providers or on-prem clusters. + dstack accelerates ML development with a simple, ML‑native interface. + Spin up dev environments, run + single‑node or distributed tasks, and deploy services without infrastructure overhead.

- +

- Seamlessly integrated with Git, dstack works with any open-source or proprietary frameworks, - making it developer-friendly and vendor-agnostic for training and deploying AI models. + It radically reduces GPU costs via smart orchestration and fine‑grained fleet controls, including efficient reuse, + right‑sizing, and support for spot, on‑demand, and reserved capacity.

- +

- For ML teams seeking a more streamlined, AI-native development platform, dstack - provides an alternative to Kubernetes and Slurm, removing the need for - MLOps or custom solutions. + It is 100% interoperable with your stack and works with any open‑source frameworks and tools, as + well as your own Docker images and code, across cloud, Kubernetes, and on‑prem GPUs.