Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
5290d98
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 12, 2025
53a717d
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 12, 2025
e412313
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 12, 2025
22b2530
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 12, 2025
2891ccb
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 12, 2025
5b0539f
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
a4e6d29
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
b102ddb
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
ee05d46
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
eac604a
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
0172ab7
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
8418ad7
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
b5710a0
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
f46afe8
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
59edf64
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
76ed3bc
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
0e36c85
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
6225135
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 13, 2025
dec318f
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
5a34a9c
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
562266f
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
2fd1bdf
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
dbb3545
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
bcfc856
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
fc5a8eb
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
1902cde
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
709b63c
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
312a9ee
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
2b58f67
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
8ae1be6
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
a24bd1a
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
2946a20
Revert "[UX] Pre-build a EFA version of the default Docker image #2793"
peterschmidt85 Jun 14, 2025
9903d01
Revert "[UX] Pre-build a EFA version of the default Docker image #2793"
peterschmidt85 Jun 14, 2025
8b796bb
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
305e5f1
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
46a7d51
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
0cbf5b9
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
0105f70
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 14, 2025
31dfd39
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 15, 2025
2358683
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 15, 2025
27520eb
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 15, 2025
de67511
Merge remote-tracking branch 'origin/master' into 2793-ux-pre-build-a…
peterschmidt85 Jun 15, 2025
bdaa059
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 16, 2025
035eced
[UX] Pre-build a EFA version of the default Docker image #2793
peterschmidt85 Jun 16, 2025
57ab13d
Experimental: Build default Docker images for both Ubuntu versions 20…
peterschmidt85 Jun 16, 2025
8acfff9
Experimental: Build default Docker images for both Ubuntu versions 20…
peterschmidt85 Jun 16, 2025
1b1c02b
Experimental: Build default Docker images for both Ubuntu versions 20…
peterschmidt85 Jun 16, 2025
78dc094
Experimental: Build default Docker images for both Ubuntu versions 20…
peterschmidt85 Jun 17, 2025
1367167
Added `OMPI_MCA_pml`, `OMPI_MCA_btl`, `OMPI_MCA_btl_tcp_if_exclude`, …
peterschmidt85 Jun 17, 2025
6d5ceb2
- [x] Added `OMPI_MCA_pml`, `OMPI_MCA_btl`, `OMPI_MCA_btl_tcp_if_exc…
peterschmidt85 Jun 17, 2025
3894ba9
- [x] Added OMPI_MCA_pml, OMPI_MCA_btl, OMPI_MCA_btl_tcp_if_exclude, …
peterschmidt85 Jun 17, 2025
a646ebc
Fixed broken tests
peterschmidt85 Jun 17, 2025
48b4b9d
Removed Ubuntu 20.04 from Ci/CD
peterschmidt85 Jun 17, 2025
d8d755c
Fixed broken tests
peterschmidt85 Jun 17, 2025
49b7cb9
Fixed tests
peterschmidt85 Jun 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 0 additions & 46 deletions .github/workflows/docker-efa.yml

This file was deleted.

20 changes: 17 additions & 3 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python: ["3.9", "3.10", "3.11", "3.12", "3.13"]
flavor: ["base", "devel"]
flavor: ["base", "devel", "devel-efa"]
ubuntu_version: ["22"]
steps:
- name: Checkout repository
uses: actions/checkout@v4
Expand All @@ -67,7 +67,21 @@ jobs:
uses: docker/setup-qemu-action@v3
- name: Build and upload to DockerHub
run: |
docker buildx build --platform linux/amd64 --build-arg FLAVOR=${{ matrix.flavor }} --build-arg PYTHON=${{ matrix.python }} --push --provenance=false --tag dstackai/${{ env.BUILD_DOCKER_REPO }}:py${{ matrix.python }}-${{ inputs.image_version }}-cuda-12.1${{ matrix.flavor == 'devel' && '-devel' || '' }} -f base/Dockerfile .
if [ "${{ matrix.flavor }}" = "base" ]; then
FILE="base/Dockerfile"
elif [ "${{ matrix.flavor }}" = "devel" ]; then
FILE="base/Dockerfile"
else
FILE="base/efa/Dockerfile"
fi
docker buildx build \
--platform linux/amd64 \
--tag dstackai/${{ env.BUILD_DOCKER_REPO }}:${{ inputs.image_version }}-${{ matrix.flavor }}-ubuntu${{ matrix.ubuntu_version }}.04 \
--build-arg FLAVOR=${{ matrix.flavor }} \
--build-arg UBUNTU_VERSION=${{ matrix.ubuntu_version }} \
--provenance=false \
--push \
-f $FILE .

build-aws-images:
needs: build-docker
Expand Down
105 changes: 78 additions & 27 deletions docker/base/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,28 +1,79 @@
# syntax = edrevo/dockerfile-plus
ARG UBUNTU_VERSION

# Build stage
FROM nvidia/cuda:12.1.1-base-ubuntu${UBUNTU_VERSION}.04 AS builder

ENV NCCL_HOME=/opt/nccl
ENV CUDA_HOME=/usr/local/cuda
ENV OPEN_MPI_PATH=/usr/lib/x86_64-linux-gnu/openmpi

# Prerequisites

RUN export DEBIAN_FRONTEND=noninteractive \
&& apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}04/x86_64/3bf863cc.pub \
&& apt-get update --fix-missing \
&& apt-get upgrade -y \
&& ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime \
&& apt-get install -y tzdata \
&& dpkg-reconfigure --frontend noninteractive tzdata \
&& cuda_version=$(echo ${CUDA_VERSION} | awk -F . '{ print $1"-"$2 }') \
&& apt-get install -y --no-install-recommends \
cuda-libraries-dev-${cuda_version} \
cuda-nvcc-${cuda_version} \
libhwloc-dev \
autoconf \
automake \
libtool \
libopenmpi-dev \
git \
curl \
python3 \
build-essential

# NCCL

ARG NCCL_VERSION=2.26.2-1

RUN cd /tmp \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION} \
&& cd nccl \
&& make -j$(nproc) src.build BUILDDIR=${NCCL_HOME}

# NCCL tests

RUN cd /opt \
&& git clone https://github.com/NVIDIA/nccl-tests \
&& cd nccl-tests \
&& make -j$(nproc) \
MPI=1 \
MPI_HOME=${OPEN_MPI_PATH} \
CUDA_HOME=${CUDA_HOME} \
NCCL_HOME=${NCCL_HOME}

# Final stage

INCLUDE+ base/Dockerfile.common

ENV NCCL_HOME=/opt/nccl

COPY --from=builder ${NCCL_HOME} ${NCCL_HOME}
COPY --from=builder /opt/nccl-tests/build /opt/nccl-tests/build

ARG FLAVOR
FROM nvidia/cuda:12.1.1-${FLAVOR}-ubuntu20.04

ARG PYTHON
ARG _UV_HOME="/opt/uv"
ENV UV_PYTHON="${PYTHON}"
ENV UV_INSTALL_DIR="${_UV_HOME}/bin"
ENV UV_PYTHON_INSTALL_DIR="${_UV_HOME}/python"
ENV UV_PYTHON_BIN_DIR="${UV_PYTHON_INSTALL_DIR}/bin"
ENV UV_MANAGED_PYTHON=1
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

ENV PATH="${UV_INSTALL_DIR}:${UV_PYTHON_BIN_DIR}:${PATH}"

RUN export DEBIAN_FRONTEND=noninteractive && \
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub && \
apt-get update --fix-missing && \
apt-get upgrade -y && \
ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime && \
apt-get install -y tzdata && \
dpkg-reconfigure --frontend noninteractive tzdata && \
apt-get install -y bzip2 ca-certificates curl build-essential git libglib2.0-0 libsm6 libxext6 libxrender1 mercurial openssh-server subversion wget \
libibverbs1 ibverbs-providers ibverbs-utils libibverbs-dev infiniband-diags && \
sed -i "s/.*PasswordAuthentication.*/PasswordAuthentication no/g" /etc/ssh/sshd_config && mkdir /run/sshd && \
mkdir ~/.ssh && chmod 700 ~/.ssh && touch ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys && rm /etc/ssh/ssh_host_*

RUN curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_NO_MODIFY_PATH=1 sh && \
uv python install --preview --default

# MPI, NVCC, and /etc/ld.so.conf.d

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
openmpi-bin \
&& if [ "$FLAVOR" = "devel" ]; then \
cuda_version=$(echo ${CUDA_VERSION} | awk -F . '{ print $1"-"$2 }') \
&& apt-get install -y --no-install-recommends \
cuda-libraries-dev-${cuda_version} \
cuda-nvcc-${cuda_version} \
libhwloc-dev; \
fi \
&& rm -rf /var/lib/apt/lists/* \
&& echo "${NCCL_HOME}/lib" >> /etc/ld.so.conf.d/nccl.conf \
&& ldconfig
35 changes: 35 additions & 0 deletions docker/base/Dockerfile.common
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
ARG UBUNTU_VERSION

FROM nvidia/cuda:12.1.1-base-ubuntu${UBUNTU_VERSION}.04

ARG _UV_HOME="/opt/uv"

ENV UV_INSTALL_DIR="${_UV_HOME}/bin"
ENV UV_MANAGED_PYTHON=1
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

ENV PATH="${UV_INSTALL_DIR}:${PATH}"

ENV OMPI_MCA_pml=^cm,ucx
ENV OMPI_MCA_btl=tcp,self
ENV OMPI_MCA_btl_tcp_if_exclude=lo,docker0
ENV NCCL_SOCKET_IFNAME=^docker,lo

RUN export DEBIAN_FRONTEND=noninteractive \
&& apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}04/x86_64/3bf863cc.pub \
&& apt-get update --fix-missing \
&& apt-get upgrade -y \
&& ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime \
&& apt-get install -y tzdata \
&& dpkg-reconfigure --frontend noninteractive tzdata \
&& apt-get install -y bzip2 ca-certificates curl build-essential git libglib2.0-0 libsm6 libxext6 libxrender1 mercurial openssh-server subversion wget \
libibverbs1 ibverbs-providers ibverbs-utils libibverbs-dev infiniband-diags \
&& rm -rf /var/lib/apt/lists/* \
&& sed -i "s/.*PasswordAuthentication.*/PasswordAuthentication no/g" /etc/ssh/sshd_config \
&& mkdir /run/sshd \
&& mkdir ~/.ssh && chmod 700 ~/.ssh && touch ~/.ssh/authorized_keys \
&& chmod 600 ~/.ssh/authorized_keys \
&& rm /etc/ssh/ssh_host_*

RUN curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_NO_MODIFY_PATH=1 sh \
&& uv python install --preview --default
55 changes: 26 additions & 29 deletions docker/efa/Dockerfile → docker/base/efa/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
ARG BASE_IMAGE=dstackai/base:py3.12-0.7-cuda-12.1
# syntax = edrevo/dockerfile-plus
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we live without it? An unfamiliar dependency that is no longer maintained.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this dependency, we would need to duplicate the code


FROM ${BASE_IMAGE}
INCLUDE+ base/Dockerfile.common

ENV PREFIX=/usr/local
ENV CUDA_PATH=/usr/local/cuda
ENV NCCL_HOME=/usr/local
ENV CUDA_HOME=/usr/local/cuda
ENV LIBFABRIC_PATH=/opt/amazon/efa
ENV OPEN_MPI_PATH=/opt/amazon/openmpi
ENV PATH="${LIBFABRIC_PATH}/bin:${OPEN_MPI_PATH}/bin:${PATH}"
ENV LD_LIBRARY_PATH="${OPEN_MPI_PATH}/lib:${LD_LIBRARY_PATH}"

# prerequisites
# Prerequisites

RUN cuda_version=$(echo ${CUDA_VERSION} | awk -F . '{ print $1"-"$2 }') \
&& apt-get update \
Expand All @@ -19,61 +19,58 @@ RUN cuda_version=$(echo ${CUDA_VERSION} | awk -F . '{ print $1"-"$2 }') \
libhwloc-dev \
autoconf \
automake \
libtool
libtool \
&& rm -rf /var/lib/apt/lists/*

# EFA

ARG EFA_VERSION=1.38.1

RUN cd $HOME \
RUN cd /tmp \
&& apt-get update \
&& curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf /tmp/aws-efa-installer /var/lib/apt/lists/*

# NCCL

ARG NCCL_VERSION=2.26.2-1

RUN cd $HOME \
RUN cd /tmp \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION} \
&& cd nccl \
&& make -j$(nproc) src.build BUILDDIR=${PREFIX}
&& make -j$(nproc) src.build BUILDDIR=${NCCL_HOME} \
&& rm -rf /tmp/nccl

# AWS OFI NCCL

ARG OFI_VERSION=1.14.0

RUN cd $HOME \
RUN cd /tmp \
&& git clone https://github.com/aws/aws-ofi-nccl.git -b v${OFI_VERSION} \
&& cd aws-ofi-nccl \
&& ./autogen.sh \
&& ./configure \
--with-cuda=${CUDA_PATH} \
--with-cuda=${CUDA_HOME} \
--with-libfabric=${LIBFABRIC_PATH} \
--with-mpi=${OPEN_MPI_PATH} \
--with-cuda=${CUDA_PATH} \
--with-nccl=${PREFIX} \
--with-cuda=${CUDA_HOME} \
--with-nccl=${NCCL_HOME} \
--disable-tests \
--prefix=${PREFIX} \
&& make -j$(numproc) \
&& make install
--prefix=${NCCL_HOME} \
&& make -j$(nproc) \
&& make install \
&& rm -rf /tmp/aws-ofi-nccl /var/lib/apt/lists/*

# NCCL Tests

RUN cd $HOME \
RUN cd /opt \
&& git clone https://github.com/NVIDIA/nccl-tests \
&& cd nccl-tests \
&& make -j$(numproc) \
&& make -j$(nproc) \
MPI=1 \
MPI_HOME=${OPEN_MPI_PATH} \
CUDA_HOME=${CUDA_PATH} \
NCCL_HOME=${PREFIX}

ARG BUILD_DATE
ARG IMAGE_NAME
ARG DSTACK_REVISION

LABEL org.opencontainers.image.title="${IMAGE_NAME}"
LABEL org.opencontainers.image.version="${EFA_VERSION}-${DSTACK_REVISION}"
LABEL org.opencontainers.image.created="${BUILD_DATE}"
CUDA_HOME=${CUDA_HOME} \
NCCL_HOME=${NCCL_HOME}
File renamed without changes.
8 changes: 2 additions & 6 deletions examples/clusters/nccl-tests/.dstack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,22 @@ nodes: 2
startup_order: workers-first
stop_criteria: master-done

# This image comes with MPI and NCCL tests pre-built
image: dstackai/efa
env:
- NCCL_DEBUG=INFO
commands:
- cd /root/nccl-tests/build
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--mca btl_tcp_if_exclude lo,docker0 \
--bind-to none \
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi

resources:
gpu: nvidia:4:16GB
gpu: nvidia:1..8
shm_size: 16GB
6 changes: 2 additions & 4 deletions examples/distributed-training/torchrun/.dstack.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
type: task
name: train-distrib

# The size of the cluster
nodes: 2

python: 3.12
Expand All @@ -21,6 +20,5 @@ commands:
multinode.py 50 10

resources:
gpu: 24GB:1..2
# Uncomment if using multiple GPUs
#shm_size: 24GB
gpu: 1..8
shm_size: 16GB
8 changes: 0 additions & 8 deletions scripts/packer/aws-image-cuda.json
Original file line number Diff line number Diff line change
Expand Up @@ -81,14 +81,6 @@
{
"type": "shell",
"script": "provisioners/install-nvidia-container-toolkit.sh"
},
{
"type": "shell",
"environment_vars": [
"IMAGE_REPO={{user `image_repo`}}",
"IMAGE_VERSION={{user `image_version`}}"
],
"script": "provisioners/pull-docker-images.sh"
}
]
}
Loading
Loading