Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
2 changes: 1 addition & 1 deletion docs/docs/concepts/fleets.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ This ensures all instances are provisioned with optimal inter-node connectivity.
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
Otherwise, instances are only connected by the default VPC subnet.

Refer to the [EFA](../../blog/posts/efa.md) example for more details.
Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details.

??? info "GCP"
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/guides/clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ For cloud fleets, fast interconnect is currently supported only on the `aws`, `g

!!! info "Backend configuration"
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details.

=== "GCP"
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
Expand Down
14 changes: 12 additions & 2 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ hide:
<a href="/examples/clusters/a3mega"
class="feature-cell sky">
<h3>
A3 Mega
GCP A3 Mega
</h3>

<p>
Expand All @@ -113,13 +113,23 @@ hide:
<a href="/examples/clusters/a3high"
class="feature-cell sky">
<h3>
A3 High
GCP A3 High
</h3>

<p>
Set up GCP A3 High clusters with optimized networking
</p>
</a>
<a href="/examples/clusters/efa"
class="feature-cell sky">
<h3>
AWS EFA
</h3>

<p>
Set up AWS EFA clusters with optimized networking
</p>
</a>
</div>

## Inference
Expand Down
Empty file.
198 changes: 198 additions & 0 deletions examples/clusters/efa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# AWS EFA

In this guide, we’ll walk through how to run high-performance distributed training on AWS using [Amazon Elastic Fabric Adapter (EFA) :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"} with `dstack`.

## Overview

EFA is a network interface for Amazon EC2 that enables low-latency, high-bandwidth inter-node communication — essential for scaling distributed deep learning. With `dstack`, EFA is automatically enabled when you create fleets with supported instance types.

## Prerequisite

Before you start, make sure the `aws` backend is properly configured.

<div editor-title="~/.dstack/server/config.yml">

```yaml
projects:
- name: main
backends:
- type: aws
creds:
type: default
regions: ["us-west-2"]

public_ips: false
vpc_name: my-custom-vpc
```

</div>

!!! info "Multiple network interfaces"
To use P4, P5, or P6 instances, set `public_ips` to `false` — this allows AWS to attach multiple network interfaces for EFA. In this case, the `dstack` server can reach your VPC’s private subnets.

!!! info "VPC"
If you use a custom VPC, verify that it permits all internal traffic between nodes for EFA to function properly

## Create a fleet

Once your backend is ready, define a fleet configuration.

<div editor-title="examples/clusters/efa/fleet.dstack.yml">

```yaml
type: fleet
name: my-efa-fleet

nodes: 2
placement: cluster

resources:
gpu: H100:8
```

</div>

Provision the fleet with `dstack apply`:

<div class="termy">

```shell
$ dstack apply -f examples/clusters/efa/fleet.dstack.yml

Provisioning...
---> 100%

FLEET INSTANCE BACKEND INSTANCE TYPE GPU PRICE STATUS CREATED
my-efa-fleet 0 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago
1 aws (us-west-2) p4d.24xlarge $98.32 idle 3 mins ago
```

</div>

??? info "Instance types"
`dstack` selects suitable instances automatically, but not
[all types support EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}.
To enforce EFA, you can specify `instance_types` explicitly:

```yaml
type: fleet
name: my-efa-fleet

nodes: 2
placement: cluster

resources:
gpu: L4

instance_types: ["g6.8xlarge"] # If not specified, g6.xlarge is used (won't have EFA)
```

## Run NCCL tests

To confirm that EFA is working, run NCCL tests:

<div editor-title="examples/clusters/nccl-tests/.dstack.yml">

```yaml
type: task
name: nccl-tests

nodes: 2

startup_order: workers-first
stop_criteria: master-done

env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi

resources:
gpu: 1..8
shm_size: 16GB
```

</div>

Run it with `dstack apply`:

<div class="termy">

```shell
$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml

Provisioning...
---> 100%
```

</div>

!!! info "Docker image"
You can use your own container by setting `image`. If omitted, `dstack` uses its default image with drivers, NCCL tests, and tools pre-installed.

## Run distributed training

Here’s an example using `torchrun` for a simple multi-node PyTorch job:

<div editor-title="examples/distributed-training/torchrun/.dstack.yml">

```yaml
type: task
name: train-distrib

nodes: 2

python: 3.12
env:
- NCCL_DEBUG=INFO
commands:
- git clone https://github.com/pytorch/examples.git pytorch-examples
- cd pytorch-examples/distributed/ddp-tutorial-series
- uv pip install -r requirements.txt
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
multinode.py 50 10

resources:
gpu: 1..8
shm_size: 16GB
```

</div>

Provision and launch it via `dstack apply`.

<div class="termy">

```shell
$ dstack apply -f examples/distributed-training/torchrun/.dstack.yml

Provisioning...
---> 100%
```

</div>

Instead of setting `python`, you can specify your own Docker image using `image`. Make sure that the image is properly configured for EFA.

!!! info "What's next"
1. Learn more about [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks)
2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments),
[services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets)
3. Read the [Clusters](https://dstack.ai/docs/guides/clusters) guide
8 changes: 3 additions & 5 deletions examples/clusters/nccl-tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,30 +13,28 @@ type: task
name: nccl-tests

nodes: 2

startup_order: workers-first
stop_criteria: master-done

image: dstackai/efa
env:
- NCCL_DEBUG=INFO
commands:
- cd /root/nccl-tests/build
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--mca btl_tcp_if_exclude lo,docker0 \
--bind-to none \
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi

resources:
gpu: nvidia:4:16GB
gpu: nvidia:1..8
shm_size: 16GB
```

Expand Down
8 changes: 5 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ plugins:
'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-metrics.md'
'blog/inactive-dev-environments-auto-shutdown.md': 'blog/posts/inactivity-duration.md'
'blog/data-centers-and-private-clouds.md': 'blog/posts/gpu-blocks-and-proxy-jump.md'
'blog/distributed-training-with-aws-efa.md': 'blog/posts/efa.md'
'blog/distributed-training-with-aws-efa.md': 'examples/clusters/efa/index.md'
'blog/dstack-stats.md': 'blog/posts/dstack-metrics.md'
'docs/concepts/metrics.md': 'docs/guides/metrics.md'
'docs/guides/monitoring.md': 'docs/guides/metrics.md'
Expand All @@ -122,6 +122,7 @@ plugins:
'examples/deployment/trtllm/index.md': 'examples/inference/trtllm/index.md'
'examples/fine-tuning/trl/index.md': 'examples/single-node-training/trl/index.md'
'examples/fine-tuning/axolotl/index.md': 'examples/single-node-training/axolotl/index.md'
'blog/efa.md': 'examples/clusters/efa/index.md'
- typeset
- gen-files:
scripts: # always relative to mkdocs.yml
Expand Down Expand Up @@ -271,8 +272,9 @@ nav:
- Clusters:
- NCCL tests: examples/clusters/nccl-tests/index.md
- RCCL tests: examples/clusters/rccl-tests/index.md
- A3 Mega: examples/clusters/a3mega/index.md
- A3 High: examples/clusters/a3high/index.md
- GCP A3 Mega: examples/clusters/a3mega/index.md
- GCP A3 High: examples/clusters/a3high/index.md
- AWS EFA: examples/clusters/efa/index.md
- Inference:
- SGLang: examples/inference/sglang/index.md
- vLLM: examples/inference/vllm/index.md
Expand Down
Loading