Skip to content

Commit 16bcfd3

Browse files
[Docs] Add Crusoe example under Clusters (#3381)
1 parent 56a9824 commit 16bcfd3

File tree

5 files changed

+309
-5
lines changed

5 files changed

+309
-5
lines changed

docs/docs/guides/kubernetes.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ projects:
1818
- name: main
1919
backends:
2020
- type: kubernetes
21-
kubeconfig:
22-
filename: ~/.kube/config
23-
proxy_jump:
24-
hostname: 204.12.171.137
25-
port: 32000
21+
kubeconfig:
22+
filename: ~/.kube/config
23+
proxy_jump:
24+
hostname: 204.12.171.137
25+
port: 32000
2626
```
2727
2828
</div>

docs/examples.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,16 @@ hide:
140140
Set up AWS EFA clusters with optimized networking
141141
</p>
142142
</a>
143+
<a href="/examples/clusters/crusoe"
144+
class="feature-cell sky">
145+
<h3>
146+
Crusoe
147+
</h3>
148+
149+
<p>
150+
Set up Crusoe clusters with optimized networking
151+
</p>
152+
</a>
143153
</div>
144154

145155
## Inference

docs/examples/clusters/crusoe/index.md

Whitespace-only changes.

examples/clusters/crusoe/README.md

Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
# Crusoe
2+
3+
Crusoe offers two ways to use clusters with fast interconnect:
4+
5+
* [Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA GPU operators and related tools.
6+
* [Virtual Machines (VMs)](#vms) – Gives you direct access to clusters in the form of virtual machines.
7+
8+
Both options use the same underlying networking infrastructure. This example walks you through how to set up Crusoe clusters to use with `dstack`.
9+
10+
## Kubernetes
11+
12+
!!! info "Prerequsisites"
13+
1. Go `Networking``Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
14+
2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
15+
3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance. If you intend to auto-scale the cluster, make sure to set `Desired Number of Nodes` at least to `1`, since `dstack` doesn't currently support clusters that scale down to `0` nodes.
16+
4. Wait until at least one node is running.
17+
18+
### Configure the backend
19+
20+
Follow the standard instructions for setting up a [Kubernetes](https://dstack.ai/docs/concepts/backends/#kubernetes) backend:
21+
22+
<div editor-title="~/.dstack/server/config.yml">
23+
24+
```yaml
25+
projects:
26+
- name: main
27+
backends:
28+
- type: kubernetes
29+
kubeconfig:
30+
filename: <kubeconfig path>
31+
proxy_jump:
32+
port: 30022
33+
```
34+
35+
</div>
36+
37+
### Create a fleet
38+
39+
Once the Kubernetes cluster and the `dstack` server are running, you can create a fleet:
40+
41+
<div editor-title="crusoe-fleet.dstack.yml">
42+
43+
```yaml
44+
type: fleet
45+
name: crusoe-fleet
46+
47+
placement: cluster
48+
nodes: 0..
49+
50+
backends: [kubernetes]
51+
52+
resources:
53+
# Specify requirements to filter nodes
54+
gpu: 1..8
55+
```
56+
57+
</div>
58+
59+
Pass the fleet configuration to `dstack apply`:
60+
61+
<div class="termy">
62+
63+
```shell
64+
$ dstack apply -f crusoe-fleet.dstack.yml
65+
```
66+
67+
</div>
68+
69+
Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
70+
71+
## VMs
72+
73+
Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets).
74+
75+
!!! info "Prerequsisites"
76+
1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.
77+
78+
### Create a fleet
79+
80+
Follow the standard instructions for setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets):
81+
82+
<div editor-title="crusoe-fleet.dstack.yml">
83+
84+
```yaml
85+
type: fleet
86+
name: crusoe-fleet
87+
88+
placement: cluster
89+
90+
# SSH credentials for the on-prem servers
91+
ssh_config:
92+
user: ubuntu
93+
identity_file: ~/.ssh/id_rsa
94+
hosts:
95+
- 3.255.177.51
96+
- 3.255.177.52
97+
```
98+
99+
</div>
100+
101+
Pass the fleet configuration to `dstack apply`:
102+
103+
<div class="termy">
104+
105+
```shell
106+
$ dstack apply -f crusoe-fleet.dstack.yml
107+
```
108+
109+
</div>
110+
111+
Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
112+
113+
## Run NCCL tests
114+
115+
Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-task) that runs NCCL tests to validate cluster network bandwidth.
116+
117+
=== "Kubernetes"
118+
119+
If you’re running on Crusoe’s Kubernetes, make sure to install HPC-X and provide an up-to-date topology file.
120+
121+
<div editor-title="crusoe-nccl-tests.dstack.yml">
122+
123+
```yaml
124+
type: task
125+
name: nccl-tests
126+
127+
nodes: 2
128+
startup_order: workers-first
129+
stop_criteria: master-done
130+
131+
commands:
132+
# Install NCCL topology files
133+
- curl -sSL https://gist.github.com/un-def/48df8eea222fa9547ad4441986eb15af/archive/df51d56285c5396a0e82bb42f4f970e7bb0a9b65.tar.gz -o nccl_topo.tar.gz
134+
- mkdir -p /etc/crusoe/nccl_topo
135+
- tar -C /etc/crusoe/nccl_topo -xf nccl_topo.tar.gz --strip-components=1
136+
# Install and initialize HPC-X
137+
- curl -sSL https://content.mellanox.com/hpc/hpc-x/v2.21.3/hpcx-v2.21.3-gcc-doca_ofed-ubuntu22.04-cuda12-x86_64.tbz -o hpcx.tar.bz
138+
- mkdir -p /opt/hpcx
139+
- tar -C /opt/hpcx -xf hpcx.tar.bz --strip-components=1 --checkpoint=10000
140+
- . /opt/hpcx/hpcx-init.sh
141+
- hpcx_load
142+
# Run NCCL Tests
143+
- |
144+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
145+
mpirun \
146+
--allow-run-as-root \
147+
--hostfile $DSTACK_MPI_HOSTFILE \
148+
-n $DSTACK_GPUS_NUM \
149+
-N $DSTACK_GPUS_PER_NODE \
150+
--bind-to none \
151+
-mca btl tcp,self \
152+
-mca coll_hcoll_enable 0 \
153+
-x PATH \
154+
-x LD_LIBRARY_PATH \
155+
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
156+
-x NCCL_SOCKET_NTHREADS=4 \
157+
-x NCCL_NSOCKS_PERTHREAD=8 \
158+
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
159+
-x NCCL_IB_MERGE_VFS=0 \
160+
-x NCCL_IB_AR_THRESHOLD=0 \
161+
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
162+
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
163+
-x NCCL_IB_QPS_PER_CONNECTION=2 \
164+
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
165+
-x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
166+
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
167+
else
168+
sleep infinity
169+
fi
170+
171+
# Required for IB
172+
privileged: true
173+
174+
resources:
175+
gpu: A100:8
176+
shm_size: 16GB
177+
```
178+
179+
</div>
180+
181+
> The task above downloads an A100 topology file from a Gist. The most reliable way to obtain the latest topology is to copy it from a Crusoe-provisioned VM (see [VMs](#vms)).
182+
183+
??? info "Privileged"
184+
When running on Kubernetes, set `privileged` to `true` to ensure access to InfiniBand.
185+
186+
=== "SSH fleets"
187+
188+
With Crusoe VMs, HPC-X and up-to-date topology files are already available on the hosts. When using SSH fleets, simply mount them via [instance volumes](https://dstack.ai/docs/concepts/volumes#instance-volumes).
189+
190+
```yaml
191+
type: task
192+
name: nccl-tests
193+
194+
nodes: 2
195+
startup_order: workers-first
196+
stop_criteria: master-done
197+
198+
volumes:
199+
- /opt/hpcx:/opt/hpcx
200+
- /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo
201+
202+
commands:
203+
- . /opt/hpcx/hpcx-init.sh
204+
- hpcx_load
205+
# Run NCCL Tests
206+
- |
207+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
208+
mpirun \
209+
--allow-run-as-root \
210+
--hostfile $DSTACK_MPI_HOSTFILE \
211+
-n $DSTACK_GPUS_NUM \
212+
-N $DSTACK_GPUS_PER_NODE \
213+
--bind-to none \
214+
-mca btl tcp,self \
215+
-mca coll_hcoll_enable 0 \
216+
-x PATH \
217+
-x LD_LIBRARY_PATH \
218+
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
219+
-x NCCL_SOCKET_NTHREADS=4 \
220+
-x NCCL_NSOCKS_PERTHREAD=8 \
221+
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
222+
-x NCCL_IB_MERGE_VFS=0 \
223+
-x NCCL_IB_AR_THRESHOLD=0 \
224+
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
225+
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
226+
-x NCCL_IB_QPS_PER_CONNECTION=2 \
227+
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
228+
-x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
229+
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
230+
else
231+
sleep infinity
232+
fi
233+
234+
resources:
235+
gpu: A100:8
236+
shm_size: 16GB
237+
```
238+
239+
Pass the configuration to `dstack apply`:
240+
241+
<div class="termy">
242+
243+
```shell
244+
$ dstack apply -f crusoe-nccl-tests.dstack.yml
245+
246+
Provisioning...
247+
---> 100%
248+
249+
nccl-tests provisioning completed (running)
250+
251+
# out-of-place in-place
252+
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
253+
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
254+
8 2 float sum -1 27.70 0.00 0.00 0 29.82 0.00 0.00 0
255+
16 4 float sum -1 28.78 0.00 0.00 0 28.99 0.00 0.00 0
256+
32 8 float sum -1 28.49 0.00 0.00 0 28.16 0.00 0.00 0
257+
64 16 float sum -1 28.41 0.00 0.00 0 28.69 0.00 0.00 0
258+
128 32 float sum -1 28.94 0.00 0.01 0 28.58 0.00 0.01 0
259+
256 64 float sum -1 29.46 0.01 0.02 0 29.45 0.01 0.02 0
260+
512 128 float sum -1 30.23 0.02 0.03 0 29.85 0.02 0.03 0
261+
1024 256 float sum -1 30.79 0.03 0.06 0 34.03 0.03 0.06 0
262+
2048 512 float sum -1 37.90 0.05 0.10 0 33.22 0.06 0.12 0
263+
4096 1024 float sum -1 35.91 0.11 0.21 0 35.30 0.12 0.22 0
264+
8192 2048 float sum -1 36.84 0.22 0.42 0 38.30 0.21 0.40 0
265+
16384 4096 float sum -1 47.08 0.35 0.65 0 37.26 0.44 0.82 0
266+
32768 8192 float sum -1 45.20 0.72 1.36 0 48.70 0.67 1.26 0
267+
65536 16384 float sum -1 49.43 1.33 2.49 0 50.97 1.29 2.41 0
268+
131072 32768 float sum -1 51.08 2.57 4.81 0 50.17 2.61 4.90 0
269+
262144 65536 float sum -1 192.78 1.36 2.55 0 100.00 2.62 4.92 0
270+
524288 131072 float sum -1 68.02 7.71 14.45 0 69.40 7.55 14.16 0
271+
1048576 262144 float sum -1 81.71 12.83 24.06 0 88.58 11.84 22.20 0
272+
2097152 524288 float sum -1 113.03 18.55 34.79 0 102.21 20.52 38.47 0
273+
4194304 1048576 float sum -1 123.50 33.96 63.68 0 131.71 31.84 59.71 0
274+
8388608 2097152 float sum -1 189.42 44.29 83.04 0 183.01 45.84 85.95 0
275+
16777216 4194304 float sum -1 274.05 61.22 114.79 0 265.91 63.09 118.30 0
276+
33554432 8388608 float sum -1 490.77 68.37 128.20 0 490.53 68.40 128.26 0
277+
67108864 16777216 float sum -1 854.62 78.52 147.23 0 853.49 78.63 147.43 0
278+
134217728 33554432 float sum -1 1483.43 90.48 169.65 0 1479.22 90.74 170.13 0
279+
268435456 67108864 float sum -1 2700.36 99.41 186.39 0 2700.49 99.40 186.38 0
280+
536870912 134217728 float sum -1 5300.49 101.29 189.91 0 5314.91 101.01 189.40 0
281+
1073741824 268435456 float sum -1 10472.2 102.53 192.25 0 10485.6 102.40 192.00 0
282+
2147483648 536870912 float sum -1 20749.1 103.50 194.06 0 20745.7 103.51 194.09 0
283+
# Out of bounds values : 0 OK
284+
# Avg bus bandwidth : 53.7387
285+
```
286+
287+
</div>
288+
289+
## What's next
290+
291+
1. Learn about [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), [services](https://dstack.ai/docs/concepts/services)
292+
2. Read the [Kuberentes](https://dstack.ai/docs/guides/kubernetes), and [Clusters](https://dstack.ai/docs/guides/clusters) guides
293+
3. Check Crusoe's docs on [networking](https://docs.crusoecloud.com/networking/infiniband/) and [Kubernetes](https://docs.crusoecloud.com/orchestration/cmk/index.html)

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,7 @@ nav:
326326
- GCP A3 Mega: examples/clusters/a3mega/index.md
327327
- GCP A3 High: examples/clusters/a3high/index.md
328328
- AWS EFA: examples/clusters/efa/index.md
329+
- Crusoe: examples/clusters/crusoe/index.md
329330
- Inference:
330331
- SGLang: examples/inference/sglang/index.md
331332
- vLLM: examples/inference/vllm/index.md

0 commit comments

Comments
 (0)