diff --git a/docs/blog/posts/efa.md b/docs/blog/archive/efa.md similarity index 100% rename from docs/blog/posts/efa.md rename to docs/blog/archive/efa.md diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md index e02db2b127..76e99a24c7 100644 --- a/docs/docs/concepts/fleets.md +++ b/docs/docs/concepts/fleets.md @@ -70,7 +70,7 @@ This ensures all instances are provisioned with optimal inter-node connectivity. Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. Otherwise, instances are only connected by the default VPC subnet. - Refer to the [EFA](../../blog/posts/efa.md) example for more details. + Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details. ??? info "GCP" When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured. diff --git a/docs/docs/guides/clusters.md b/docs/docs/guides/clusters.md index cf4b1b4171..ce81a69fc5 100644 --- a/docs/docs/guides/clusters.md +++ b/docs/docs/guides/clusters.md @@ -22,7 +22,7 @@ For cloud fleets, fast interconnect is currently supported only on the `aws`, `g !!! info "Backend configuration" Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. - Refer to the [EFA](../../blog/posts/efa.md) example for more details. + Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details. === "GCP" When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured. diff --git a/docs/examples.md b/docs/examples.md index 17df886a9f..cb2bd9e558 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -103,7 +103,7 @@ hide:

- A3 Mega + GCP A3 Mega

@@ -113,13 +113,23 @@ hide:

- A3 High + GCP A3 High

Set up GCP A3 High clusters with optimized networking

+ +

+ AWS EFA +

+ +

+ Set up AWS EFA clusters with optimized networking +

+
## Inference diff --git a/docs/examples/clusters/efa/index.md b/docs/examples/clusters/efa/index.md new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/clusters/efa/README.md b/examples/clusters/efa/README.md new file mode 100644 index 0000000000..2e8de135ff --- /dev/null +++ b/examples/clusters/efa/README.md @@ -0,0 +1,198 @@ +# AWS EFA + +In this guide, we’ll walk through how to run high-performance distributed training on AWS using [Amazon Elastic Fabric Adapter (EFA) :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"} with `dstack`. + +## Overview + +EFA is a network interface for Amazon EC2 that enables low-latency, high-bandwidth inter-node communication — essential for scaling distributed deep learning. With `dstack`, EFA is automatically enabled when you create fleets with supported instance types. + +## Prerequisite + +Before you start, make sure the `aws` backend is properly configured. + +
+ +```yaml +projects: +- name: main + backends: + - type: aws + creds: + type: default + regions: ["us-west-2"] + + public_ips: false + vpc_name: my-custom-vpc +``` + +
+ +!!! info "Multiple network interfaces" + To use P4, P5, or P6 instances, set `public_ips` to `false` — this allows AWS to attach multiple network interfaces for EFA. In this case, the `dstack` server can reach your VPC’s private subnets. + +!!! info "VPC" + If you use a custom VPC, verify that it permits all internal traffic between nodes for EFA to function properly + +## Create a fleet + +Once your backend is ready, define a fleet configuration. + +
+ + ```yaml + type: fleet + name: my-efa-fleet + + nodes: 2 + placement: cluster + + resources: + gpu: H100:8 + ``` + +
+ +Provision the fleet with `dstack apply`: + +
+ +```shell +$ dstack apply -f examples/clusters/efa/fleet.dstack.yml + +Provisioning... +---> 100% + + FLEET INSTANCE BACKEND INSTANCE TYPE GPU PRICE STATUS CREATED + my-efa-fleet 0 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago + 1 aws (us-west-2) p4d.24xlarge $98.32 idle 3 mins ago +``` + +
+ +??? info "Instance types" + `dstack` selects suitable instances automatically, but not + [all types support EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}. + To enforce EFA, you can specify `instance_types` explicitly: + + ```yaml + type: fleet + name: my-efa-fleet + + nodes: 2 + placement: cluster + + resources: + gpu: L4 + + instance_types: ["g6.8xlarge"] # If not specified, g6.xlarge is used (won't have EFA) + ``` + +## Run NCCL tests + +To confirm that EFA is working, run NCCL tests: + +
+ +```yaml +type: task +name: nccl-tests + +nodes: 2 + +startup_order: workers-first +stop_criteria: master-done + +env: + - NCCL_DEBUG=INFO +commands: + - | + if [ $DSTACK_NODE_RANK -eq 0 ]; then + mpirun \ + --allow-run-as-root \ + --hostfile $DSTACK_MPI_HOSTFILE \ + -n $DSTACK_GPUS_NUM \ + -N $DSTACK_GPUS_PER_NODE \ + --bind-to none \ + all_reduce_perf -b 8 -e 8G -f 2 -g 1 + else + sleep infinity + fi + +resources: + gpu: 1..8 + shm_size: 16GB +``` + +
+ +Run it with `dstack apply`: + +
+ +```shell +$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml + +Provisioning... +---> 100% +``` + +
+ +!!! info "Docker image" + You can use your own container by setting `image`. If omitted, `dstack` uses its default image with drivers, NCCL tests, and tools pre-installed. + +## Run distributed training + +Here’s an example using `torchrun` for a simple multi-node PyTorch job: + +
+ +```yaml +type: task +name: train-distrib + +nodes: 2 + +python: 3.12 +env: + - NCCL_DEBUG=INFO +commands: + - git clone https://github.com/pytorch/examples.git pytorch-examples + - cd pytorch-examples/distributed/ddp-tutorial-series + - uv pip install -r requirements.txt + - | + torchrun \ + --nproc-per-node=$DSTACK_GPUS_PER_NODE \ + --node-rank=$DSTACK_NODE_RANK \ + --nnodes=$DSTACK_NODES_NUM \ + --master-addr=$DSTACK_MASTER_NODE_IP \ + --master-port=12345 \ + multinode.py 50 10 + +resources: + gpu: 1..8 + shm_size: 16GB +``` + +
+ +Provision and launch it via `dstack apply`. + +
+ +```shell +$ dstack apply -f examples/distributed-training/torchrun/.dstack.yml + +Provisioning... +---> 100% +``` + +
+ +Instead of setting `python`, you can specify your own Docker image using `image`. Make sure that the image is properly configured for EFA. + +!!! info "What's next" + 1. Learn more about [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks) + 2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), + [services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets) + 3. Read the [Clusters](https://dstack.ai/docs/guides/clusters) guide diff --git a/examples/clusters/nccl-tests/README.md b/examples/clusters/nccl-tests/README.md index ee1bf7fc9f..7e5a88d64f 100644 --- a/examples/clusters/nccl-tests/README.md +++ b/examples/clusters/nccl-tests/README.md @@ -13,14 +13,13 @@ type: task name: nccl-tests nodes: 2 + startup_order: workers-first stop_criteria: master-done -image: dstackai/efa env: - NCCL_DEBUG=INFO commands: - - cd /root/nccl-tests/build - | if [ $DSTACK_NODE_RANK -eq 0 ]; then mpirun \ @@ -28,15 +27,14 @@ commands: --hostfile $DSTACK_MPI_HOSTFILE \ -n $DSTACK_GPUS_NUM \ -N $DSTACK_GPUS_PER_NODE \ - --mca btl_tcp_if_exclude lo,docker0 \ --bind-to none \ - ./all_reduce_perf -b 8 -e 8G -f 2 -g 1 + /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 else sleep infinity fi resources: - gpu: nvidia:4:16GB + gpu: nvidia:1..8 shm_size: 16GB ``` diff --git a/mkdocs.yml b/mkdocs.yml index 5bb5718c67..2f2b7b4d06 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -105,7 +105,7 @@ plugins: 'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-metrics.md' 'blog/inactive-dev-environments-auto-shutdown.md': 'blog/posts/inactivity-duration.md' 'blog/data-centers-and-private-clouds.md': 'blog/posts/gpu-blocks-and-proxy-jump.md' - 'blog/distributed-training-with-aws-efa.md': 'blog/posts/efa.md' + 'blog/distributed-training-with-aws-efa.md': 'examples/clusters/efa/index.md' 'blog/dstack-stats.md': 'blog/posts/dstack-metrics.md' 'docs/concepts/metrics.md': 'docs/guides/metrics.md' 'docs/guides/monitoring.md': 'docs/guides/metrics.md' @@ -122,6 +122,7 @@ plugins: 'examples/deployment/trtllm/index.md': 'examples/inference/trtllm/index.md' 'examples/fine-tuning/trl/index.md': 'examples/single-node-training/trl/index.md' 'examples/fine-tuning/axolotl/index.md': 'examples/single-node-training/axolotl/index.md' + 'blog/efa.md': 'examples/clusters/efa/index.md' - typeset - gen-files: scripts: # always relative to mkdocs.yml @@ -271,8 +272,9 @@ nav: - Clusters: - NCCL tests: examples/clusters/nccl-tests/index.md - RCCL tests: examples/clusters/rccl-tests/index.md - - A3 Mega: examples/clusters/a3mega/index.md - - A3 High: examples/clusters/a3high/index.md + - GCP A3 Mega: examples/clusters/a3mega/index.md + - GCP A3 High: examples/clusters/a3high/index.md + - AWS EFA: examples/clusters/efa/index.md - Inference: - SGLang: examples/inference/sglang/index.md - vLLM: examples/inference/vllm/index.md