Skip to content

Commit 4fd74ba

Browse files
authored
Merge branch 'master' into add_pd_disaggregated_inference
2 parents 6e7dbe7 + 775aff0 commit 4fd74ba

File tree

28 files changed

+1290
-565
lines changed

28 files changed

+1290
-565
lines changed

docs/docs/concepts/backends.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -853,7 +853,7 @@ Then, go ahead and configure the backend:
853853
projects:
854854
- name: main
855855
backends:
856-
- type: datacrunch
856+
- type: verda
857857
creds:
858858
type: api_key
859859
client_id: xfaHBqYEsArqhKWX-e52x3HH7w8T
@@ -1049,13 +1049,13 @@ projects:
10491049
verbs: ["get", "create"]
10501050
- apiGroups: [""]
10511051
resources: ["pods"]
1052-
verbs: ["get", "create", "delete"]
1052+
verbs: ["get", "create", "delete", "list"]
10531053
- apiGroups: [""]
10541054
resources: ["services"]
10551055
verbs: ["get", "create", "delete"]
10561056
- apiGroups: [""]
10571057
resources: ["nodes"]
1058-
verbs: ["list"]
1058+
verbs: ["list", "get"]
10591059
```
10601060

10611061
Ensure you've created a ClusterRoleBinding to grant the role to the user or the service account you're using.

docs/docs/guides/protips.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -439,10 +439,10 @@ Getting offers...
439439
---> 100%
440440
441441
# BACKEND REGION INSTANCE TYPE RESOURCES SPOT PRICE
442-
1 datacrunch FIN-01 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
443-
2 datacrunch FIN-02 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
444-
3 datacrunch FIN-02 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
445-
4 datacrunch ICE-01 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
442+
1 verda FIN-01 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
443+
2 verda FIN-02 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
444+
3 verda FIN-02 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
445+
4 verda ICE-01 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
446446
5 runpod US-KS-2 NVIDIA H100 PCIe 16xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.39
447447
6 runpod CA NVIDIA H100 80GB HBM3 24xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.69
448448
7 nebius eu-north1 gpu-h100-sxm 16xCPU, 200GB, 1xH100 (80GB), 100.0GB (disk) no $2.95

docs/docs/reference/cli/dstack/offer.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,10 @@ Getting offers...
5858
---> 100%
5959

6060
# BACKEND REGION INSTANCE TYPE RESOURCES SPOT PRICE
61-
1 datacrunch FIN-01 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
62-
2 datacrunch FIN-02 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
63-
3 datacrunch FIN-02 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
64-
4 datacrunch ICE-01 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
61+
1 verda FIN-01 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
62+
2 verda FIN-02 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
63+
3 verda FIN-02 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
64+
4 verda ICE-01 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
6565
5 runpod US-KS-2 NVIDIA H100 PCIe 16xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.39
6666
6 runpod CA NVIDIA H100 80GB HBM3 24xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.69
6767
7 nebius eu-north1 gpu-h100-sxm 16xCPU, 200GB, 1xH100 (80GB), 100.0GB (disk) no $2.95

docs/docs/reference/dstack.yml/service.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ The `service` configuration type allows running [services](../../concepts/servic
6363
1. Doesn't work if your `chat_template` uses `bos_token`. As a workaround, replace `bos_token` inside `chat_template` with the token content itself.
6464
2. Doesn't work if `eos_token` is defined in the model repository as a dictionary. As a workaround, set `eos_token` manually, as shown in the example above (see Chat template).
6565

66-
If you encounter any other issues, please make sure to file a
66+
If you encounter any ofther issues, please make sure to file a
6767
[GitHub issue](https://github.com/dstackai/dstack/issues/new/choose).
6868

6969
### `scaling`
@@ -127,6 +127,16 @@ The `service` configuration type allows running [services](../../concepts/servic
127127
required: true
128128

129129

130+
### `replicas`
131+
132+
#### `replicas[n]`
133+
134+
#SCHEMA# dstack._internal.core.models.configurations.ReplicaGroup
135+
overrides:
136+
show_root_heading: false
137+
type:
138+
required: true
139+
130140
### `retry`
131141

132142
#SCHEMA# dstack._internal.core.models.profiles.ProfileRetry

docs/docs/reference/server/config.yml.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,6 @@ to configure [backends](../../concepts/backends.md) and other [server-level sett
1414
#SCHEMA# dstack._internal.server.services.config.ProjectConfig
1515
overrides:
1616
show_root_heading: false
17-
backends:
18-
type: 'Union[AWSBackendConfigWithCreds, AzureBackendConfigWithCreds, GCPBackendConfigWithCreds, HotAisleBackendConfigWithCreds, LambdaBackendConfigWithCreds, NebiusBackendConfigWithCreds, RunpodBackendConfigWithCreds, VastAIBackendConfigWithCreds, KubernetesConfig]'
1917

2018
#### `projects[n].backends` { #backends data-toc-label="backends" }
2119

docs/examples.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,16 @@ hide:
122122
Set up Crusoe clusters with optimized networking
123123
</p>
124124
</a>
125+
<a href="/examples/clusters/nebius"
126+
class="feature-cell sky">
127+
<h3>
128+
Nebius
129+
</h3>
130+
131+
<p>
132+
Set up Nebius clusters with optimized networking
133+
</p>
134+
</a>
125135
<a href="/examples/clusters/nccl-rccl-tests"
126136
class="feature-cell sky">
127137
<h3>

docs/examples/clusters/nebius/index.md

Whitespace-only changes.

examples/clusters/crusoe/README.md

Lines changed: 25 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,25 @@
11
---
22
title: Crusoe
3-
description: Setting up Crusoe clusters using Managed Kubernetes or VMs with InfiniBand support
3+
description: Using Crusoe clusters with InfiniBand support via Kubernetes or VMs
44
---
55

66
# Crusoe
77

8-
Crusoe offers two ways to use clusters with fast interconnect:
8+
`dstack` allows using Crusoe clusters with fast interconnect via two ways:
99

10-
* [Crusoe Managed Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA and AMD GPU operators and related tools.
11-
* [Virtual Machines (VMs)](#vms) – Gives you direct access to clusters in the form of virtual machines with NVIDIA and AMD GPUs.
10+
* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Crusoe and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
11+
* [VMs](#vms) – If you create a VM cluster on Crusoe and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
12+
13+
## Kubernetes
1214

13-
Both options use the same underlying networking infrastructure. This example walks you through how to set up Crusoe clusters to use with `dstack`.
15+
### Create a cluster
1416

15-
## Crusoe Managed Kubernetes { #kubernetes }
17+
1. Go `Networking``Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
18+
2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
19+
3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance, and `Desired Number of Nodes`.
20+
4. Wait until nodes are provisioned.
1621

17-
!!! info "Prerequsisites"
18-
1. Go `Networking``Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
19-
2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
20-
3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance. If you intend to auto-scale the cluster, make sure to set `Desired Number of Nodes` at least to `1`, since `dstack` doesn't currently support clusters that scale down to `0` nodes.
21-
4. Wait until at least one node is running.
22+
> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned.
2223
2324
### Configure the backend
2425

@@ -56,7 +57,7 @@ backends: [kubernetes]
5657
5758
resources:
5859
# Specify requirements to filter nodes
59-
gpu: 1..8
60+
gpu: 8
6061
```
6162

6263
</div>
@@ -75,12 +76,13 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs
7576

7677
## VMs
7778

78-
Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets).
79+
Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets#ssh-fleets).
7980

80-
!!! info "Prerequsisites"
81-
1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.
81+
### Create instances
8282

83-
### Create a fleet
83+
1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.
84+
85+
### Create a `dstack` fleet
8486

8587
Follow the standard instructions for setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets):
8688

@@ -115,9 +117,9 @@ $ dstack apply -f crusoe-fleet.dstack.yml
115117

116118
Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
117119

118-
## Run NCCL tests
120+
## NCCL tests
119121

120-
Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-task) that runs NCCL tests to validate cluster network bandwidth.
122+
Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) that runs NCCL tests to validate cluster network bandwidth.
121123

122124
=== "Crusoe Managed Kubernetes"
123125

@@ -253,9 +255,9 @@ Provisioning...
253255
254256
nccl-tests provisioning completed (running)
255257
256-
# out-of-place in-place
257-
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
258-
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
258+
out-of-place in-place
259+
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
260+
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
259261
8 2 float sum -1 27.70 0.00 0.00 0 29.82 0.00 0.00 0
260262
16 4 float sum -1 28.78 0.00 0.00 0 28.99 0.00 0.00 0
261263
32 8 float sum -1 28.49 0.00 0.00 0 28.16 0.00 0.00 0
@@ -285,8 +287,8 @@ nccl-tests provisioning completed (running)
285287
536870912 134217728 float sum -1 5300.49 101.29 189.91 0 5314.91 101.01 189.40 0
286288
1073741824 268435456 float sum -1 10472.2 102.53 192.25 0 10485.6 102.40 192.00 0
287289
2147483648 536870912 float sum -1 20749.1 103.50 194.06 0 20745.7 103.51 194.09 0
288-
# Out of bounds values : 0 OK
289-
# Avg bus bandwidth : 53.7387
290+
Out of bounds values : 0 OK
291+
Avg bus bandwidth : 53.7387
290292
```
291293

292294
</div>

examples/clusters/lambda/README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,17 @@ description: Setting up Lambda clusters using Kubernetes or 1-Click Clusters wit
55

66
# Lambda
77

8-
[Lambda](https://lambda.ai/) offers two ways to use clusters with a fast interconnect:
8+
`dstack` allows using Lambda clusters with fast interconnect via two ways:
99

10-
* [Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA GPU operators and related tools.
11-
* [1-Click Clusters (1CC)](#1-click-clusters) – Gives you direct access to clusters in the form of bare-metal nodes.
12-
13-
Both options use the same underlying networking infrastructure. This example walks you through how to set up Lambda clusters to use with `dstack`.
10+
* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Lambda and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
11+
* [VMs](#vms) – If you create a 1CC cluster on Lambda and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
1412

1513
## Kubernetes
1614

17-
!!! info "Prerequsisites"
18-
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s.
19-
2. Go to `Firewall``Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
15+
### Prerequsisites
16+
17+
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s.
18+
2. Go to `Firewall``Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
2019

2120
### Configure the backend
2221

@@ -75,8 +74,9 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs
7574

7675
Another way to work with Lambda clusters is through [1CC](https://lambda.ai/1-click-clusters). While `dstack` supports automated cluster provisioning via [VM-based backends](https://dstack.ai/docs/concepts/backends#vm-based), there is currently no programmatic way to provision Lambda 1CCs. As a result, to use a 1CC cluster with `dstack`, you must use [SSH fleets](https://dstack.ai/docs/concepts/fleets).
7776

78-
!!! info "Prerequsisites"
79-
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters
77+
### Prerequsisites
78+
79+
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters
8080

8181
### Create a fleet
8282

@@ -171,11 +171,11 @@ $ dstack apply -f lambda-nccl-tests.dstack.yml
171171
Provisioning...
172172
---> 100%
173173
174-
# nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602
175-
# Collective test starting: all_reduce_perf
176-
#
177-
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
178-
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
174+
nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602
175+
Collective test starting: all_reduce_perf
176+
177+
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
178+
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
179179
8 2 float sum -1 36.50 0.00 0.00 0 36.16 0.00 0.00 0
180180
16 4 float sum -1 35.55 0.00 0.00 0 35.49 0.00 0.00 0
181181
32 8 float sum -1 35.49 0.00 0.00 0 36.28 0.00 0.00 0
@@ -205,8 +205,8 @@ Provisioning...
205205
536870912 134217728 float sum -1 1625.63 330.25 619.23 0 1687.31 318.18 596.59 0
206206
1073741824 268435456 float sum -1 2972.25 361.26 677.35 0 2971.33 361.37 677.56 0
207207
2147483648 536870912 float sum -1 5784.75 371.23 696.06 0 5728.40 374.88 702.91 0
208-
# Out of bounds values : 0 OK
209-
# Avg bus bandwidth : 137.179
208+
Out of bounds values : 0 OK
209+
Avg bus bandwidth : 137.179
210210
```
211211

212212
</div>

0 commit comments

Comments
 (0)