You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/clusters/gcp/README.md
+67-9Lines changed: 67 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -307,7 +307,9 @@ Once you've configured the `gcp` backend, create the fleet configuration:
307
307
308
308
Once the fleet is created, you can run distributed tasks, in addition to dev environments, services, and regular tasks.
309
309
310
-
## Run NCCL tests
310
+
## Run tasks
311
+
312
+
### NCCL tests
311
313
312
314
Use a distributed task that runs NCCL tests to validate cluster network bandwidth.
313
315
@@ -343,11 +345,9 @@ Use a distributed task that runs NCCL tests to validate cluster network bandwidt
343
345
344
346
</div>
345
347
346
-
!!! info "Source code"
347
-
The source code of the task can be found at [examples/clusters/nccl-tests/.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-tests/.dstack.yml).
348
-
349
348
=== "A3 Mega"
350
-
> To fully use GPUDirect-TCPX0, properly set the required [NCCL environment variables]([NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl)).
349
+
!!! info "Source code"
350
+
The source code of the task can be found at [examples/clusters/gcp/a3mega-nccl-tests.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/gcp/a3mega-nccl-tests.dstack.yml).
351
351
352
352
Pass the configuration to `dstack apply`:
353
353
@@ -379,11 +379,9 @@ Use a distributed task that runs NCCL tests to validate cluster network bandwidt
379
379
380
380
</div>
381
381
382
-
!!! info "Source code"
383
-
The source code of the task can be found at [examples/clusters/gcp/a3mega-nccl-tests.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/gcp/a3mega-nccl-tests.dstack.yml).
384
-
385
382
=== "A3 High/Edge"
386
-
> To fully use GPUDirect-TCPX, properly set the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl). Since we use a ready-to-use Docker image, these environment variables are already preconfigured.
383
+
!!! info "Source code"
384
+
The source code of the task can be found at [examples/clusters/nccl-tests/.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-tests/.dstack.yml).
387
385
388
386
Pass the configuration to `dstack apply`:
389
387
@@ -418,6 +416,66 @@ Use a distributed task that runs NCCL tests to validate cluster network bandwidt
418
416
!!! info "Source code"
419
417
The source code of the task can be found at [examples/clusters/gcp/a3high-nccl-tests.dstack.yml](https://github.com/dstackai/dstack/blob/master/examples/clusters/gcp/a3high-nccl-tests.dstack.yml).
420
418
419
+
### Distributed training
420
+
421
+
=== "A4"
422
+
You can use the standard [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) example to run distributed training on A4 instances.
423
+
424
+
=== "A3 Mega"
425
+
You can use the standard [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) example to run distributed training on A3 Mega instances. To enable GPUDirect-TCPX, make sure the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl) are properly set, for example by adding the following commands at the beginning:
You can use the standard [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) example to run distributed training on A3 High/Edge instances. To enable GPUDirect-TCPX0, make sure the required [NCCL environment variables](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot#environment-variables-nccl) are properly set, for example by adding the following commands at the beginning:
0 commit comments