Skip to content

Commit e4a849b

Browse files
[Docs] Merge GCP clusters examples
Added disttributed tasks examples
1 parent 81594ea commit e4a849b

1 file changed

Lines changed: 67 additions & 9 deletions

File tree

examples/clusters/gcp/README.md

Lines changed: 67 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,9 @@ Once you've configured the `gcp` backend, create the fleet configuration:
307307

308308
Once the fleet is created, you can run distributed tasks, in addition to dev environments, services, and regular tasks.
309309

310-
## Run NCCL tests
310+
## Run tasks
311+
312+
### NCCL tests
311313

312314
Use a distributed task that runs NCCL tests to validate cluster network bandwidth.
313315

@@ -343,11 +345,9 @@ Use a distributed task that runs NCCL tests to validate cluster network bandwidt
343345

344346
</div>
345347

346-
!!! info "Source code"
347-
The source code of the task can be found at [examples/clusters/nccl-tests/.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-tests/.dstack.yml).
348-
349348
=== "A3 Mega"
350-
> To fully use GPUDirect-TCPX0, properly set the required [NCCL environment variables]([NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl)).
349+
!!! info "Source code"
350+
The source code of the task can be found at [examples/clusters/gcp/a3mega-nccl-tests.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/gcp/a3mega-nccl-tests.dstack.yml).
351351

352352
Pass the configuration to `dstack apply`:
353353

@@ -379,11 +379,9 @@ Use a distributed task that runs NCCL tests to validate cluster network bandwidt
379379

380380
</div>
381381

382-
!!! info "Source code"
383-
The source code of the task can be found at [examples/clusters/gcp/a3mega-nccl-tests.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/gcp/a3mega-nccl-tests.dstack.yml).
384-
385382
=== "A3 High/Edge"
386-
> To fully use GPUDirect-TCPX, properly set the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl). Since we use a ready-to-use Docker image, these environment variables are already preconfigured.
383+
!!! info "Source code"
384+
The source code of the task can be found at [examples/clusters/nccl-tests/.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-tests/.dstack.yml).
387385

388386
Pass the configuration to `dstack apply`:
389387

@@ -418,6 +416,66 @@ Use a distributed task that runs NCCL tests to validate cluster network bandwidt
418416
!!! info "Source code"
419417
The source code of the task can be found at [examples/clusters/gcp/a3high-nccl-tests.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/gcp/a3high-nccl-tests.dstack.yml).
420418

419+
### Distributed training
420+
421+
=== "A4"
422+
You can use the standard [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) example to run distributed training on A4 instances.
423+
424+
=== "A3 Mega"
425+
You can use the standard [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) example to run distributed training on A3 Mega instances. To enable GPUDirect-TCPX, make sure the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl) are properly set, for example by adding the following commands at the beginning:
426+
427+
```shell
428+
# ...
429+
430+
commands:
431+
- |
432+
NCCL_LIB_DIR="/var/lib/tcpxo/lib64"
433+
source ${NCCL_LIB_DIR}/nccl-env-profile-ll128.sh
434+
export NCCL_FASTRAK_CTRL_DEV=enp0s12
435+
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
436+
export NCCL_SOCKET_IFNAME=enp0s12
437+
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices"
438+
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
439+
440+
# ...
441+
```
442+
443+
=== "A3 High/Edge"
444+
You can use the standard [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) example to run distributed training on A3 High/Edge instances. To enable GPUDirect-TCPX0, make sure the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl) are properly set, for example by adding the following commands at the beginning:
445+
446+
```shell
447+
# ...
448+
449+
commands:
450+
- |
451+
export NCCL_DEBUG=INFO
452+
NCCL_LIB_DIR="/usr/local/tcpx/lib64"
453+
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
454+
export NCCL_SOCKET_IFNAME=eth0
455+
export NCCL_CROSS_NIC=0
456+
export NCCL_ALGO=Ring
457+
export NCCL_PROTO=Simple
458+
export NCCL_NSOCKS_PERTHREAD=4
459+
export NCCL_SOCKET_NTHREADS=1
460+
export NCCL_NET_GDR_LEVEL=PIX
461+
export NCCL_P2P_PXN_LEVEL=0
462+
export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
463+
export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
464+
export NCCL_DYNAMIC_CHUNK_SIZE=524288
465+
export NCCL_P2P_NET_CHUNKSIZE=524288
466+
export NCCL_P2P_PCI_CHUNKSIZE=524288
467+
export NCCL_P2P_NVL_CHUNKSIZE=1048576
468+
export NCCL_BUFFSIZE=4194304
469+
export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
470+
export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
471+
export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
472+
export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
473+
474+
# ...
475+
```
476+
477+
In addition to distributed training, you can of course run regular tasks, dev environments, and services.
478+
421479
## What's new
422480

423481
1. Learn about [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), [services](https://dstack.ai/docs/concepts/services)

0 commit comments

Comments
 (0)