dstackai · un-def · Aug 20, 2025 · Aug 20, 2025 · Aug 20, 2025
diff --git a/docs/blog/posts/amd-on-tensorwave.md b/docs/blog/posts/amd-on-tensorwave.md
@@ -1,20 +1,20 @@
 ---
 title: Using SSH fleets with TensorWave's private AMD cloud
 date: 2025-03-11
-description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets."  
+description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets."
 slug: amd-on-tensorwave
 image: https://dstack.ai/static-assets/static-assets/images/dstack-tensorwave-v2.png
 categories:
   - Case studies
 ---
 
-# Using SSH fleets with TensorWave's private AMD cloud 
+# Using SSH fleets with TensorWave's private AMD cloud
 
 Since last month, when we introduced support for private clouds and data centers, it has become easier to use `dstack`
 to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters.
 
 In this tutorial, we’ll walk you through how `dstack` can be used with
-[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using 
+[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using
 [SSH fleets](../../docs/concepts/fleets.md#ssh).
 
 <img src="https://dstack.ai/static-assets/static-assets/images/dstack-tensorwave-v2.png" width="630"/>
@@ -32,13 +32,12 @@ TensorWave dashboard.
 ## Creating a fleet
 
 ??? info "Prerequisites"
-    Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project repo folder and run `dstack init`.
+    Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project folder.
 
     <div class="termy">
 
     ```shell
     $ mkdir tensorwave-demo && cd tensorwave-demo
-    $ dstack init
     ```
 
     </div>
@@ -79,9 +78,9 @@ $ dstack apply -f fleet.dstack.yml
 Provisioning...
 ---> 100%
 
- FLEET                INSTANCE  RESOURCES         STATUS     CREATED 
- my-tensorwave-fleet  0         8xMI300X (192GB)  0/8 busy   3 mins ago      
-                      1         8xMI300X (192GB)  0/8 busy   3 mins ago    
+ FLEET                INSTANCE  RESOURCES         STATUS     CREATED
+ my-tensorwave-fleet  0         8xMI300X (192GB)  0/8 busy   3 mins ago
+                      1         8xMI300X (192GB)  0/8 busy   3 mins ago
 
 ```
 
@@ -98,7 +97,7 @@ Once the fleet is created, you can use `dstack` to run workloads.
 
 A dev environment lets you access an instance through your desktop IDE.
 
-<div editor-title=".dstack.yml"> 
+<div editor-title=".dstack.yml">
 
 ```yaml
 type: dev-environment
@@ -137,9 +136,9 @@ Open the link to access the dev environment using your desktop IDE.
 
 A task allows you to schedule a job or run a web app. Tasks can be distributed and support port forwarding.
 
-Below is a distributed training task configuration: 
+Below is a distributed training task configuration:
 
-<div editor-title="train.dstack.yml"> 
+<div editor-title="train.dstack.yml">
 
 ```yaml
 type: task
@@ -175,7 +174,7 @@ Provisioning `train-distrib`...
 
 </div>
 
-`dstack` automatically runs the container on each node while passing 
+`dstack` automatically runs the container on each node while passing
 [system environment variables](../../docs/concepts/tasks.md#system-environment-variables)
 which you can use with `torchrun`, `accelerate`, or other distributed frameworks.
 
@@ -185,7 +184,7 @@ A service allows you to deploy a model or any web app as a scalable and secure e
 
 Create the following configuration file inside the repo:
 
-<div editor-title="deepseek.dstack.yml"> 
+<div editor-title="deepseek.dstack.yml">
 
 ```yaml
 type: service
@@ -196,7 +195,7 @@ env:
   - MODEL_ID=deepseek-ai/DeepSeek-R1
   - HSA_NO_SCRATCH_RECLAIM=1
 commands:
-  - python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code 
+  - python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code
 port: 8000
 model: deepseek-ai/DeepSeek-R1
 
@@ -221,7 +220,7 @@ Submit the run `deepseek-r1-sglang`? [y/n]: y
 Provisioning `deepseek-r1-sglang`...
 ---> 100%
 
-Service is published at: 
+Service is published at:
   http://localhost:3000/proxy/services/main/deepseek-r1-sglang/
 Model deepseek-ai/DeepSeek-R1 is published at:
   http://localhost:3000/proxy/models/main/
@@ -236,6 +235,6 @@ Want to see how it works? Check out the video below:
 <iframe width="750" height="520" src="https://www.youtube.com/embed/b1vAgm5fCfE?si=qw2gYHkMjERohdad&rel=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
 
 !!! info "What's next?"
-    1. See [SSH fleets](../../docs/concepts/fleets.md#ssh) 
+    1. See [SSH fleets](../../docs/concepts/fleets.md#ssh)
     2. Read about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), and [services](../../docs/concepts/services.md)
     3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd)
diff --git a/examples/accelerators/amd/README.md b/examples/accelerators/amd/README.md
@@ -1,22 +1,22 @@
 # AMD
 
 `dstack` supports running dev environments, tasks, and services on AMD GPUs.
-You can do that by setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh) 
+You can do that by setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh)
 with on-prem AMD GPUs or configuring a backend that offers AMD GPUs such as the `runpod` backend.
 
 ## Deployment
 
-Most serving frameworks including vLLM and TGI have AMD support. Here's an example of a [service](https://dstack.ai/docs/services) that deploys 
+Most serving frameworks including vLLM and TGI have AMD support. Here's an example of a [service](https://dstack.ai/docs/services) that deploys
 Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html){:target="_blank"}.
 
 === "TGI"
-    
-    <div editor-title="examples/inference/tgi/amd/.dstack.yml"> 
-    
+
+    <div editor-title="examples/inference/tgi/amd/.dstack.yml">
+
     ```yaml
     type: service
     name: amd-service-tgi
-    
+
     # Using the official TGI's ROCm Docker image
     image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
 
@@ -30,26 +30,26 @@ Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](h
     port: 8000
     # Register the model
     model: meta-llama/Meta-Llama-3.1-70B-Instruct
-    
+
     # Uncomment to leverage spot instances
     #spot_policy: auto
-    
+
     resources:
       gpu: MI300X
       disk: 150GB
     ```
-    
+
     </div>
 
 
 === "vLLM"
 
-    <div editor-title="examples/inference/vllm/amd/.dstack.yml"> 
-    
+    <div editor-title="examples/inference/vllm/amd/.dstack.yml">
+
     ```yaml
     type: service
     name: llama31-service-vllm-amd
-    
+
     # Using RunPod's ROCm Docker image
     image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04
     # Required environment variables
@@ -84,20 +84,20 @@ Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](h
     port: 8000
     # Register the model
     model: meta-llama/Meta-Llama-3.1-70B-Instruct
-    
+
     # Uncomment to leverage spot instances
     #spot_policy: auto
-    
+
     resources:
       gpu: MI300X
       disk: 200GB
     ```
     </div>
 
     Note, maximum size of vLLM’s `KV cache` is 126192, consequently we must set `MAX_MODEL_LEN` to 126192. Adding `/opt/conda/envs/py_3.10/bin` to PATH ensures we use the Python 3.10 environment necessary for the pre-built binaries compiled specifically for this version.
-    
-    > To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3. 
-    > You can find the task to build and upload the binary in 
+
+    > To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3.
+    > You can find the task to build and upload the binary in
     > [`examples/inference/vllm/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd/){:target="_blank"}.
 
 !!! info "Docker image"
@@ -110,22 +110,25 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
 
 === "TRL"
 
-    Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"} 
+    Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"}
     and the [`mlabonne/guanaco-llama2-1k` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k){:target="_blank"}
     dataset.
-    
+
     <div editor-title="examples/single-node-training/trl/amd/.dstack.yml">
-    
+
     ```yaml
     type: task
     name: trl-amd-llama31-train
-    
+
     # Using RunPod's ROCm Docker image
     image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04
 
     # Required environment variables
     env:
       - HF_TOKEN
+    # Mount files
+    files:
+      - train.py
     # Commands of the task
     commands:
       - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
@@ -140,25 +143,25 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
       - pip install peft
       - pip install transformers datasets huggingface-hub scipy
       - cd ..
-      - python examples/single-node-training/trl/amd/train.py
-    
+      - python train.py
+
     # Uncomment to leverage spot instances
     #spot_policy: auto
-    
+
     resources:
       gpu: MI300X
       disk: 150GB
     ```
-    
+
     </div>
 
 === "Axolotl"
-    Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"} 
+    Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"}
     and the [tatsu-lab/alpaca :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/tatsu-lab/alpaca){:target="_blank"}
     dataset.
-    
+
     <div editor-title="examples/single-node-training/axolotl/amd/.dstack.yml">
-    
+
     ```yaml
     type: task
     # The name is optional, if not specified, generated randomly
@@ -198,9 +201,9 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
       - make
       - pip install .
       - cd ..
-      - accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml 
-              --wandb-project "$WANDB_PROJECT" 
-              --wandb-name "$WANDB_NAME" 
+      - accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml
+              --wandb-project "$WANDB_PROJECT"
+              --wandb-name "$WANDB_NAME"
               --hub-model-id "$HUB_MODEL_ID"
 
     resources:
@@ -211,7 +214,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
 
     Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround :material-arrow-top-right-thin:{ .external }](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile) :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm){:target="_blank"}.
 
-    > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3. 
+    > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3.
     > You can find the tasks that build and upload the binaries
     > in [`examples/single-node-training/axolotl/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd/){:target="_blank"}.
 
@@ -235,7 +238,7 @@ $ dstack apply -f examples/inference/vllm/amd/.dstack.yml
 
 ## Source code
 
-The source-code of this example can be found in 
+The source-code of this example can be found in
 [`examples/inference/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/amd){:target="_blank"},
 [`examples/inference/vllm/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd){:target="_blank"},
 [`examples/single-node-training/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd){:target="_blank"} and