Skip to content

Commit bfc2ceb

Browse files
[Docs] AWS EFA blog
1 parent 7278ff3 commit bfc2ceb

1 file changed

Lines changed: 173 additions & 0 deletions

File tree

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
---
2+
title: Efficient distributed training with AWS EFA
3+
date: 2025-02-20
4+
description: "The latest release of dstack allows you to use AWS EFA for your distributed training tasks."
5+
slug: distributed-training-with-aws-efa
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/distributed-training-with-aws-efa-v2.png?raw=true
7+
categories:
8+
- Fleets
9+
---
10+
11+
# Efficient distributed training with AWS EFA
12+
13+
[Amazon Elastic Fabric Adapter (EFA) :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"} is a high-performance network interface designed for AWS EC2 instances, enabling
14+
ultra-low latency and high-throughput communication between nodes. This makes it an ideal solution for scaling
15+
distributed training workloads across multiple GPUs and instances.
16+
17+
With the latest release of `dstack`, you can now leverage AWS EFA to supercharge your distributed training tasks.
18+
19+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/distributed-training-with-aws-efa-v2.png?raw=true" width="630"/>
20+
21+
<!-- more -->
22+
23+
## Why EFA?
24+
25+
AWS EFA delivers up to 400 Gbps of bandwidth, enabling lightning-fast GPU-to-GPU communication across nodes. By
26+
bypassing the kernel and providing direct network access, EFA minimizes latency and maximizes throughput. Its native
27+
integration with the `nccl` library ensures optimal performance for large-scale distributed training.
28+
29+
With EFA, you can scale your training tasks to thousands of nodes.
30+
31+
To use AWS EFA with `dstack`, follow these steps to run your distributed training tasks.
32+
33+
## Configure the backend
34+
35+
Before using EFA, ensure the `aws` backend is properly configured.
36+
37+
If you're using P4 or P5 instances with multiple
38+
network interfaces, you’ll need to disable public IPs. Note, the `dstack`
39+
server in this case should have access to the private subnet of the VPC.
40+
41+
You’ll also need to specify an AMI that includes the GDRCopy drivers. For example, you can use the
42+
[AWS Deep Learning Base GPU AMI :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/){:target="_blank"}.
43+
44+
Here’s an example backend configuration:
45+
46+
<server/.dstack/config.yml example>
47+
48+
<div editor-title="~/.dstack/server/config.yml">
49+
50+
```yaml
51+
projects:
52+
- name: main
53+
backends:
54+
- type: aws
55+
creds:
56+
type: default
57+
regions: ["us-west-2"]
58+
public_ips: false
59+
vpc_name: my-vpc
60+
os_images:
61+
nvidia:
62+
name: Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20241115
63+
owner: 898082745236
64+
user: ubuntu
65+
```
66+
67+
</div>
68+
69+
## Create a fleet
70+
71+
Once the backend is configured, you can create a fleet for distributed training. Here’s an example fleet
72+
configuration:
73+
74+
<div editor-title="examples/misc/fleets/efa.dstack.yml">
75+
76+
```yaml
77+
type: fleet
78+
name: my-efa-fleet
79+
80+
# Specify the number of instances
81+
nodes: 2
82+
placement: cluster
83+
84+
resources:
85+
gpu: H100:8
86+
```
87+
88+
</div>
89+
90+
To provision the fleet, use the [`dstack apply`](../../docs/reference/cli/dstack/apply.md):
91+
92+
<div class="termy">
93+
94+
```shell
95+
$ dstack apply -f examples/misc/efa/fleet.dstack.yml
96+
97+
Provisioning...
98+
---> 100%
99+
100+
FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED
101+
my-efa-fleet 0 aws (us-west-2) 8xH100:80GB $98.32 idle 3 mins ago
102+
1 aws (us-west-2) 8xH100:80GB $98.32 idle 3 mins ago
103+
```
104+
105+
</div>
106+
107+
## Submit the task
108+
109+
With the fleet provisioned, you can now submit your distributed training task. Here’s an example task configuration:
110+
111+
<div editor-title="examples/misc/efa/task.dstack.yml">
112+
113+
```yaml
114+
type: task
115+
name: efa-task
116+
117+
# The size of the cluster
118+
nodes: 2
119+
120+
python: "3.12"
121+
122+
# Commands to run on each node
123+
commands:
124+
- pip install requirements.txt
125+
- accelerate launch
126+
--num_processes $DSTACK_NODES_NUM
127+
--num_machines $DSTACK_NODES_NUM
128+
--machine_rank $DSTACK_NODE_RANK
129+
--main_process_ip $DSTACK_MASTER_NODE_IP
130+
--main_process_port 29500
131+
task.py
132+
133+
env:
134+
- LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH
135+
- FI_PROVIDER=efa
136+
- FI_EFA_USE_HUGE_PAGE=0
137+
- OMPI_MCA_pml=^cm,ucx
138+
- NCCL_TOPO_FILE=/opt/amazon/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml # Typically loaded automatically, might not be necessary
139+
- OPAL_PREFIX=/opt/amazon/openmpi
140+
- NCCL_SOCKET_IFNAME=^docker0,lo
141+
- FI_EFA_USE_DEVICE_RDMA=1
142+
- NCCL_DEBUG=INFO # Optional debugging for NCCL communication
143+
- NCCL_DEBUG_SUBSYS=TUNING
144+
145+
resources:
146+
gpu: H100:8
147+
shm_size: 24GB
148+
```
149+
150+
</div>
151+
152+
Submit the task using the [`dstack apply`](../../docs/reference/cli/dstack/apply.md):
153+
154+
<div class="termy">
155+
156+
```shell
157+
$ dstack apply -f examples/misc/efa/task.dstack.yml -R
158+
```
159+
160+
</div>
161+
162+
`dstack` will automatically run the container on each node of the cluster, passing the necessary environment variables.
163+
`nccl` will leverage the EFA drivers and the specified environment variables to enable high-performance communication via
164+
EFA.
165+
166+
> Have questions? You're welcome to join
167+
> our [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"} or talk
168+
> directly to [our team :material-arrow-top-right-thin:{ .external }](https://calendly.com/dstackai/discovery-call){:target="_blank"}.
169+
170+
!!! info "What's next?"
171+
1. Check [fleets](../../docs/concepts/fleets.md), [tasks](../../docs/concepts/tasks.md), and [volumes](../../docs/concepts/volumes.md)
172+
2. Also see [dev environments](../../docs/concepts/dev-environments.md) and [services](../../docs/concepts/services.md)
173+
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

0 commit comments

Comments
 (0)