Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 15 additions & 16 deletions docs/blog/posts/amd-on-tensorwave.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
---
title: Using SSH fleets with TensorWave's private AMD cloud
date: 2025-03-11
description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets."
description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets."
slug: amd-on-tensorwave
image: https://dstack.ai/static-assets/static-assets/images/dstack-tensorwave-v2.png
categories:
- Case studies
---

# Using SSH fleets with TensorWave's private AMD cloud
# Using SSH fleets with TensorWave's private AMD cloud

Since last month, when we introduced support for private clouds and data centers, it has become easier to use `dstack`
to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters.

In this tutorial, we’ll walk you through how `dstack` can be used with
[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using
[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using
[SSH fleets](../../docs/concepts/fleets.md#ssh).

<img src="https://dstack.ai/static-assets/static-assets/images/dstack-tensorwave-v2.png" width="630"/>
Expand All @@ -32,13 +32,12 @@ TensorWave dashboard.
## Creating a fleet

??? info "Prerequisites"
Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project repo folder and run `dstack init`.
Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project folder.

<div class="termy">

```shell
$ mkdir tensorwave-demo && cd tensorwave-demo
$ dstack init
```

</div>
Expand Down Expand Up @@ -79,9 +78,9 @@ $ dstack apply -f fleet.dstack.yml
Provisioning...
---> 100%

FLEET INSTANCE RESOURCES STATUS CREATED
my-tensorwave-fleet 0 8xMI300X (192GB) 0/8 busy 3 mins ago
1 8xMI300X (192GB) 0/8 busy 3 mins ago
FLEET INSTANCE RESOURCES STATUS CREATED
my-tensorwave-fleet 0 8xMI300X (192GB) 0/8 busy 3 mins ago
1 8xMI300X (192GB) 0/8 busy 3 mins ago

```

Expand All @@ -98,7 +97,7 @@ Once the fleet is created, you can use `dstack` to run workloads.

A dev environment lets you access an instance through your desktop IDE.

<div editor-title=".dstack.yml">
<div editor-title=".dstack.yml">

```yaml
type: dev-environment
Expand Down Expand Up @@ -137,9 +136,9 @@ Open the link to access the dev environment using your desktop IDE.

A task allows you to schedule a job or run a web app. Tasks can be distributed and support port forwarding.

Below is a distributed training task configuration:
Below is a distributed training task configuration:

<div editor-title="train.dstack.yml">
<div editor-title="train.dstack.yml">

```yaml
type: task
Expand Down Expand Up @@ -175,7 +174,7 @@ Provisioning `train-distrib`...

</div>

`dstack` automatically runs the container on each node while passing
`dstack` automatically runs the container on each node while passing
[system environment variables](../../docs/concepts/tasks.md#system-environment-variables)
which you can use with `torchrun`, `accelerate`, or other distributed frameworks.

Expand All @@ -185,7 +184,7 @@ A service allows you to deploy a model or any web app as a scalable and secure e

Create the following configuration file inside the repo:

<div editor-title="deepseek.dstack.yml">
<div editor-title="deepseek.dstack.yml">

```yaml
type: service
Expand All @@ -196,7 +195,7 @@ env:
- MODEL_ID=deepseek-ai/DeepSeek-R1
- HSA_NO_SCRATCH_RECLAIM=1
commands:
- python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code
- python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code
port: 8000
model: deepseek-ai/DeepSeek-R1

Expand All @@ -221,7 +220,7 @@ Submit the run `deepseek-r1-sglang`? [y/n]: y
Provisioning `deepseek-r1-sglang`...
---> 100%

Service is published at:
Service is published at:
http://localhost:3000/proxy/services/main/deepseek-r1-sglang/
Model deepseek-ai/DeepSeek-R1 is published at:
http://localhost:3000/proxy/models/main/
Expand All @@ -236,6 +235,6 @@ Want to see how it works? Check out the video below:
<iframe width="750" height="520" src="https://www.youtube.com/embed/b1vAgm5fCfE?si=qw2gYHkMjERohdad&rel=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

!!! info "What's next?"
1. See [SSH fleets](../../docs/concepts/fleets.md#ssh)
1. See [SSH fleets](../../docs/concepts/fleets.md#ssh)
2. Read about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), and [services](../../docs/concepts/services.md)
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd)
69 changes: 36 additions & 33 deletions examples/accelerators/amd/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# AMD

`dstack` supports running dev environments, tasks, and services on AMD GPUs.
You can do that by setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh)
You can do that by setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh)
with on-prem AMD GPUs or configuring a backend that offers AMD GPUs such as the `runpod` backend.

## Deployment

Most serving frameworks including vLLM and TGI have AMD support. Here's an example of a [service](https://dstack.ai/docs/services) that deploys
Most serving frameworks including vLLM and TGI have AMD support. Here's an example of a [service](https://dstack.ai/docs/services) that deploys
Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html){:target="_blank"}.

=== "TGI"
<div editor-title="examples/inference/tgi/amd/.dstack.yml">

<div editor-title="examples/inference/tgi/amd/.dstack.yml">

```yaml
type: service
name: amd-service-tgi

# Using the official TGI's ROCm Docker image
image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm

Expand All @@ -30,26 +30,26 @@ Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](h
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-70B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
gpu: MI300X
disk: 150GB
```

</div>


=== "vLLM"

<div editor-title="examples/inference/vllm/amd/.dstack.yml">
<div editor-title="examples/inference/vllm/amd/.dstack.yml">

```yaml
type: service
name: llama31-service-vllm-amd

# Using RunPod's ROCm Docker image
image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
# Required environment variables
Expand Down Expand Up @@ -84,20 +84,20 @@ Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](h
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-70B-Instruct

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
gpu: MI300X
disk: 200GB
```
</div>

Note, maximum size of vLLM’s `KV cache` is 126192, consequently we must set `MAX_MODEL_LEN` to 126192. Adding `/opt/conda/envs/py_3.10/bin` to PATH ensures we use the Python 3.10 environment necessary for the pre-built binaries compiled specifically for this version.
> To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3.
> You can find the task to build and upload the binary in

> To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3.
> You can find the task to build and upload the binary in
> [`examples/inference/vllm/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd/){:target="_blank"}.

!!! info "Docker image"
Expand All @@ -110,22 +110,25 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by

=== "TRL"

Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"}
Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"}
and the [`mlabonne/guanaco-llama2-1k` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k){:target="_blank"}
dataset.

<div editor-title="examples/single-node-training/trl/amd/.dstack.yml">

```yaml
type: task
name: trl-amd-llama31-train

# Using RunPod's ROCm Docker image
image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04

# Required environment variables
env:
- HF_TOKEN
# Mount files
files:
- train.py
# Commands of the task
commands:
- export PATH=/opt/conda/envs/py_3.10/bin:$PATH
Expand All @@ -140,25 +143,25 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
- pip install peft
- pip install transformers datasets huggingface-hub scipy
- cd ..
- python examples/single-node-training/trl/amd/train.py
- python train.py

# Uncomment to leverage spot instances
#spot_policy: auto

resources:
gpu: MI300X
disk: 150GB
```

</div>

=== "Axolotl"
Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"}
Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"}
and the [tatsu-lab/alpaca :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/tatsu-lab/alpaca){:target="_blank"}
dataset.

<div editor-title="examples/single-node-training/axolotl/amd/.dstack.yml">

```yaml
type: task
# The name is optional, if not specified, generated randomly
Expand Down Expand Up @@ -198,9 +201,9 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
- make
- pip install .
- cd ..
- accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml
--wandb-project "$WANDB_PROJECT"
--wandb-name "$WANDB_NAME"
- accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml
--wandb-project "$WANDB_PROJECT"
--wandb-name "$WANDB_NAME"
--hub-model-id "$HUB_MODEL_ID"

resources:
Expand All @@ -211,7 +214,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by

Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround :material-arrow-top-right-thin:{ .external }](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile) :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm){:target="_blank"}.

> To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3.
> To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3.
> You can find the tasks that build and upload the binaries
> in [`examples/single-node-training/axolotl/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd/){:target="_blank"}.

Expand All @@ -235,7 +238,7 @@ $ dstack apply -f examples/inference/vllm/amd/.dstack.yml

## Source code

The source-code of this example can be found in
The source-code of this example can be found in
[`examples/inference/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/amd){:target="_blank"},
[`examples/inference/vllm/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd){:target="_blank"},
[`examples/single-node-training/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd){:target="_blank"} and
Expand Down
Loading
Loading