diff --git a/docs/blog/posts/amd-on-tensorwave.md b/docs/blog/posts/amd-on-tensorwave.md index 52d153bed8..80a766b94d 100644 --- a/docs/blog/posts/amd-on-tensorwave.md +++ b/docs/blog/posts/amd-on-tensorwave.md @@ -1,20 +1,20 @@ --- title: Using SSH fleets with TensorWave's private AMD cloud date: 2025-03-11 -description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets." +description: "This tutorial walks you through how dstack can be used with TensorWave's private AMD cloud using SSH fleets." slug: amd-on-tensorwave image: https://dstack.ai/static-assets/static-assets/images/dstack-tensorwave-v2.png categories: - Case studies --- -# Using SSH fleets with TensorWave's private AMD cloud +# Using SSH fleets with TensorWave's private AMD cloud Since last month, when we introduced support for private clouds and data centers, it has become easier to use `dstack` to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters. In this tutorial, we’ll walk you through how `dstack` can be used with -[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using +[TensorWave :material-arrow-top-right-thin:{ .external }](https://tensorwave.com/){:target="_blank"} using [SSH fleets](../../docs/concepts/fleets.md#ssh). @@ -32,13 +32,12 @@ TensorWave dashboard. ## Creating a fleet ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project repo folder and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), create a project folder.
```shell $ mkdir tensorwave-demo && cd tensorwave-demo - $ dstack init ```
@@ -79,9 +78,9 @@ $ dstack apply -f fleet.dstack.yml Provisioning... ---> 100% - FLEET INSTANCE RESOURCES STATUS CREATED - my-tensorwave-fleet 0 8xMI300X (192GB) 0/8 busy 3 mins ago - 1 8xMI300X (192GB) 0/8 busy 3 mins ago + FLEET INSTANCE RESOURCES STATUS CREATED + my-tensorwave-fleet 0 8xMI300X (192GB) 0/8 busy 3 mins ago + 1 8xMI300X (192GB) 0/8 busy 3 mins ago ``` @@ -98,7 +97,7 @@ Once the fleet is created, you can use `dstack` to run workloads. A dev environment lets you access an instance through your desktop IDE. -
+
```yaml type: dev-environment @@ -137,9 +136,9 @@ Open the link to access the dev environment using your desktop IDE. A task allows you to schedule a job or run a web app. Tasks can be distributed and support port forwarding. -Below is a distributed training task configuration: +Below is a distributed training task configuration: -
+
```yaml type: task @@ -175,7 +174,7 @@ Provisioning `train-distrib`...
-`dstack` automatically runs the container on each node while passing +`dstack` automatically runs the container on each node while passing [system environment variables](../../docs/concepts/tasks.md#system-environment-variables) which you can use with `torchrun`, `accelerate`, or other distributed frameworks. @@ -185,7 +184,7 @@ A service allows you to deploy a model or any web app as a scalable and secure e Create the following configuration file inside the repo: -
+
```yaml type: service @@ -196,7 +195,7 @@ env: - MODEL_ID=deepseek-ai/DeepSeek-R1 - HSA_NO_SCRATCH_RECLAIM=1 commands: - - python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code + - python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --tp 8 --trust-remote-code port: 8000 model: deepseek-ai/DeepSeek-R1 @@ -221,7 +220,7 @@ Submit the run `deepseek-r1-sglang`? [y/n]: y Provisioning `deepseek-r1-sglang`... ---> 100% -Service is published at: +Service is published at: http://localhost:3000/proxy/services/main/deepseek-r1-sglang/ Model deepseek-ai/DeepSeek-R1 is published at: http://localhost:3000/proxy/models/main/ @@ -236,6 +235,6 @@ Want to see how it works? Check out the video below: !!! info "What's next?" - 1. See [SSH fleets](../../docs/concepts/fleets.md#ssh) + 1. See [SSH fleets](../../docs/concepts/fleets.md#ssh) 2. Read about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), and [services](../../docs/concepts/services.md) 3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd) diff --git a/examples/accelerators/amd/README.md b/examples/accelerators/amd/README.md index 6036594304..d75841d150 100644 --- a/examples/accelerators/amd/README.md +++ b/examples/accelerators/amd/README.md @@ -1,22 +1,22 @@ # AMD `dstack` supports running dev environments, tasks, and services on AMD GPUs. -You can do that by setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh) +You can do that by setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh) with on-prem AMD GPUs or configuring a backend that offers AMD GPUs such as the `runpod` backend. ## Deployment -Most serving frameworks including vLLM and TGI have AMD support. Here's an example of a [service](https://dstack.ai/docs/services) that deploys +Most serving frameworks including vLLM and TGI have AMD support. Here's an example of a [service](https://dstack.ai/docs/services) that deploys Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/installation_amd){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/amd-installation.html){:target="_blank"}. === "TGI" - -
- + +
+ ```yaml type: service name: amd-service-tgi - + # Using the official TGI's ROCm Docker image image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm @@ -30,26 +30,26 @@ Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](h port: 8000 # Register the model model: meta-llama/Meta-Llama-3.1-70B-Instruct - + # Uncomment to leverage spot instances #spot_policy: auto - + resources: gpu: MI300X disk: 150GB ``` - +
=== "vLLM" -
- +
+ ```yaml type: service name: llama31-service-vllm-amd - + # Using RunPod's ROCm Docker image image: runpod/pytorch:2.4.0-py3.10-rocm6.1.0-ubuntu22.04 # Required environment variables @@ -84,10 +84,10 @@ Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](h port: 8000 # Register the model model: meta-llama/Meta-Llama-3.1-70B-Instruct - + # Uncomment to leverage spot instances #spot_policy: auto - + resources: gpu: MI300X disk: 200GB @@ -95,9 +95,9 @@ Llama 3.1 70B in FP16 using [TGI :material-arrow-top-right-thin:{ .external }](h
Note, maximum size of vLLM’s `KV cache` is 126192, consequently we must set `MAX_MODEL_LEN` to 126192. Adding `/opt/conda/envs/py_3.10/bin` to PATH ensures we use the Python 3.10 environment necessary for the pre-built binaries compiled specifically for this version. - - > To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3. - > You can find the task to build and upload the binary in + + > To speed up the `vLLM-ROCm` installation, we use a pre-built binary from S3. + > You can find the task to build and upload the binary in > [`examples/inference/vllm/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd/){:target="_blank"}. !!! info "Docker image" @@ -110,22 +110,25 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by === "TRL" - Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"} + Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL :material-arrow-top-right-thin:{ .external }](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html){:target="_blank"} and the [`mlabonne/guanaco-llama2-1k` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k){:target="_blank"} dataset. - +
- + ```yaml type: task name: trl-amd-llama31-train - + # Using RunPod's ROCm Docker image image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04 # Required environment variables env: - HF_TOKEN + # Mount files + files: + - train.py # Commands of the task commands: - export PATH=/opt/conda/envs/py_3.10/bin:$PATH @@ -140,25 +143,25 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by - pip install peft - pip install transformers datasets huggingface-hub scipy - cd .. - - python examples/single-node-training/trl/amd/train.py - + - python train.py + # Uncomment to leverage spot instances #spot_policy: auto - + resources: gpu: MI300X disk: 150GB ``` - +
=== "Axolotl" - Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"} + Below is an example of fine-tuning Llama 3.1 8B using [Axolotl :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html){:target="_blank"} and the [tatsu-lab/alpaca :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/tatsu-lab/alpaca){:target="_blank"} dataset. - +
- + ```yaml type: task # The name is optional, if not specified, generated randomly @@ -198,9 +201,9 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by - make - pip install . - cd .. - - accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml - --wandb-project "$WANDB_PROJECT" - --wandb-name "$WANDB_NAME" + - accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml + --wandb-project "$WANDB_PROJECT" + --wandb-name "$WANDB_NAME" --hub-model-id "$HUB_MODEL_ID" resources: @@ -211,7 +214,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround :material-arrow-top-right-thin:{ .external }](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile) :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm){:target="_blank"}. - > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3. + > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3. > You can find the tasks that build and upload the binaries > in [`examples/single-node-training/axolotl/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd/){:target="_blank"}. @@ -235,7 +238,7 @@ $ dstack apply -f examples/inference/vllm/amd/.dstack.yml ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/inference/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/amd){:target="_blank"}, [`examples/inference/vllm/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd){:target="_blank"}, [`examples/single-node-training/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd){:target="_blank"} and diff --git a/examples/accelerators/tpu/README.md b/examples/accelerators/tpu/README.md index f5fd933138..2aa595099c 100644 --- a/examples/accelerators/tpu/README.md +++ b/examples/accelerators/tpu/README.md @@ -7,8 +7,8 @@ or request TPUs by specifying `tpu` as `vendor` ([see examples](https://dstack.a Below are a few examples on using TPUs for deployment and fine-tuning. !!! info "Multi-host TPUs" - Currently, `dstack` supports only single-host TPUs, which means that - the maximum supported number of cores is `8` (e.g. `v2-8`, `v3-8`, `v5litepod-8`, `v5p-8`, `v6e-8`). + Currently, `dstack` supports only single-host TPUs, which means that + the maximum supported number of cores is `8` (e.g. `v2-8`, `v3-8`, `v5litepod-8`, `v5p-8`, `v6e-8`). Multi-host TPU support is on the roadmap. !!! info "TPU storage" @@ -18,18 +18,18 @@ Below are a few examples on using TPUs for deployment and fine-tuning. ## Deployment Many serving frameworks including vLLM and TGI have TPU support. -Here's an example of a [service](https://dstack.ai/docs/services) that deploys Llama 3.1 8B using +Here's an example of a [service](https://dstack.ai/docs/services) that deploys Llama 3.1 8B using [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"}. === "Optimum TPU" -
- +
+ ```yaml type: service name: llama31-service-optimum-tpu - + image: dstackai/optimum-tpu:llama31 env: - HF_TOKEN @@ -41,7 +41,7 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm- port: 8000 # Register the model model: meta-llama/Meta-Llama-3.1-8B-Instruct - + resources: gpu: v5litepod-4 ``` @@ -50,14 +50,14 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm- Note that for Optimum TPU `MAX_INPUT_TOKEN` is set to 4095 by default. We must also set `MAX_BATCH_PREFILL_TOKENS` to 4095. ??? info "Docker image" - The official Docker image `huggingface/optimum-tpu:latest` doesn’t support Llama 3.1-8B. - We’ve created a custom image with the fix: `dstackai/optimum-tpu:llama31`. - Once the [pull request :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu/pull/92){:target="_blank"} is merged, + The official Docker image `huggingface/optimum-tpu:latest` doesn’t support Llama 3.1-8B. + We’ve created a custom image with the fix: `dstackai/optimum-tpu:llama31`. + Once the [pull request :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu/pull/92){:target="_blank"} is merged, the official Docker image can be used. === "vLLM" -
- +
+ ```yaml type: service name: llama31-service-vllm-tpu @@ -79,17 +79,17 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm- - pip install -r requirements-tpu.txt - apt-get install -y libopenblas-base libopenmpi-dev libomp-dev - python setup.py develop - - vllm serve $MODEL_ID - --tensor-parallel-size 4 + - vllm serve $MODEL_ID + --tensor-parallel-size 4 --max-model-len $MAX_MODEL_LEN --port 8000 port: 8000 # Register the model model: meta-llama/Meta-Llama-3.1-8B-Instruct - + # Uncomment to leverage spot instances #spot_policy: auto - + resources: gpu: v5litepod-4 ``` @@ -123,11 +123,11 @@ cloud resources and run the configuration. ## Fine-tuning with Optimum TPU -Below is an example of fine-tuning Llama 3.1 8B using [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"} +Below is an example of fine-tuning Llama 3.1 8B using [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"} and the [`Abirate/english_quotes` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/Abirate/english_quotes){:target="_blank"} dataset. -
+
```yaml type: task @@ -136,11 +136,14 @@ name: optimum-tpu-llama-train python: "3.11" env: - HF_TOKEN +files: + - train.py + - config.yaml commands: - git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git - mkdir -p optimum-tpu/examples/custom/ - - cp examples/single-node-training/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py - - cp examples/single-node-training/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml + - cp train.py optimum-tpu/examples/custom/train.py + - cp config.yaml optimum-tpu/examples/custom/config.yaml - cd optimum-tpu - pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html - pip install datasets evaluate @@ -178,7 +181,7 @@ Note, `v5litepod` is optimized for fine-tuning transformer-based models. Each co ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/inference/tgi/tpu` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/tpu){:target="_blank"}, [`examples/inference/vllm/tpu` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/tpu){:target="_blank"}, and [`examples/single-node-training/optimum-tpu` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl){:target="_blank"}. @@ -188,5 +191,5 @@ and [`examples/single-node-training/optimum-tpu` :material-arrow-top-right-thin: 1. Browse [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu), [Optimum TPU TGI :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) and [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/tpu-installation.html). -2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), +2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), [services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/concepts/fleets). diff --git a/examples/distributed-training/axolotl/README.md b/examples/distributed-training/axolotl/README.md index dd4b7cdb04..17efaf1e1a 100644 --- a/examples/distributed-training/axolotl/README.md +++ b/examples/distributed-training/axolotl/README.md @@ -3,14 +3,13 @@ This example walks you through how to run distributed fine-tune using [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/axolotl-ai-cloud/axolotl){:target="_blank"} with `dstack`. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -67,7 +66,7 @@ commands: --machine_rank=$DSTACK_NODE_RANK \ --num_processes=$DSTACK_GPUS_NUM \ --num_machines=$DSTACK_NODES_NUM - + resources: gpu: 80GB:8 shm_size: 128GB @@ -93,10 +92,10 @@ $ WANDB_PROJECT=... $ HUB_MODEL_ID=... $ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml - # BACKEND RESOURCES INSTANCE TYPE PRICE - 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle - 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle - + # BACKEND RESOURCES INSTANCE TYPE PRICE + 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle + 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle + Submit the run trl-train-fsdp-distrib? [y/n]: y Provisioning... @@ -106,10 +105,10 @@ Provisioning... ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/distributed-training/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl). !!! info "What's next?" 1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide - 2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), + 2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), [services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets) diff --git a/examples/distributed-training/trl/README.md b/examples/distributed-training/trl/README.md index 7ac67047e8..3e3977c89e 100644 --- a/examples/distributed-training/trl/README.md +++ b/examples/distributed-training/trl/README.md @@ -3,14 +3,13 @@ This example walks you through how to run distributed fine-tune using [TRL :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/trl){:target="_blank"}, [Accelerate :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/accelerate){:target="_blank"} and [Deepspeed :material-arrow-top-right-thin:{ .external }](https://github.com/deepspeedai/DeepSpeed){:target="_blank"}. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -41,7 +40,7 @@ Once the fleet is created, define a distributed task configuration. Here's an ex - WANDB_API_KEY - MODEL_ID=meta-llama/Llama-3.1-8B - HUB_MODEL_ID - + commands: - pip install transformers bitsandbytes peft wandb - git clone https://github.com/huggingface/trl @@ -98,7 +97,7 @@ Once the fleet is created, define a distributed task configuration. Here's an ex - HUB_MODEL_ID - MODEL_ID=meta-llama/Llama-3.1-8B - ACCELERATE_LOG_LEVEL=info - + commands: - pip install transformers bitsandbytes peft wandb deepspeed - git clone https://github.com/huggingface/trl @@ -153,10 +152,10 @@ $ WANDB_API_KEY=... $ HUB_MODEL_ID=... $ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml - # BACKEND RESOURCES INSTANCE TYPE PRICE - 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle - 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle - + # BACKEND RESOURCES INSTANCE TYPE PRICE + 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle + 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle + Submit the run trl-train-fsdp-distrib? [y/n]: y Provisioning... @@ -166,11 +165,10 @@ Provisioning... ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl){:target="_blank"}. !!! info "What's next?" 1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide - 2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), + 2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), [services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets) - diff --git a/examples/inference/nim/README.md b/examples/inference/nim/README.md index ba68018000..fe520e36bd 100644 --- a/examples/inference/nim/README.md +++ b/examples/inference/nim/README.md @@ -3,19 +3,18 @@ title: NVIDIA NIM description: "This example shows how to deploy DeepSeek-R1-Distill-Llama-8B to any cloud or on-premises environment using NVIDIA NIM and dstack." --- -# NVIDIA NIM +# NVIDIA NIM This example shows how to deploy DeepSeek-R1-Distill-Llama-8B using [NVIDIA NIM :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html){:target="_blank"} and `dstack`. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -59,7 +58,7 @@ resources: ### Running a configuration -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. +To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
@@ -67,10 +66,10 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc $ NGC_API_KEY=... $ dstack apply -f examples/inference/nim/.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199 - 2 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199 - 3 vultr nrt 6xCPU, 60GB, 1xA100 (40GB) no $1.199 + # BACKEND REGION RESOURCES SPOT PRICE + 1 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199 + 2 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199 + 3 vultr nrt 6xCPU, 60GB, 1xA100 (40GB) no $1.199 Submit the run serve-distill-deepseek? [y/n]: y @@ -79,7 +78,7 @@ Provisioning... ```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint +If no gateway is created, the model will be available via the OpenAI-compatible endpoint at `/proxy/models//`.
@@ -107,12 +106,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint is available at `https://gateway./`. ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/inference/nim` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/nim){:target="_blank"}. ## What's next? diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md index f945e8db5d..f880ac30b7 100644 --- a/examples/inference/sglang/README.md +++ b/examples/inference/sglang/README.md @@ -3,14 +3,13 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} and `dstack`. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -19,7 +18,7 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGL Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang. === "AMD" - +
```yaml @@ -29,7 +28,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B image: lmsysorg/sglang:v0.4.1.post4-rocm620 env: - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B - + commands: - python3 -m sglang.launch_server --model-path $MODEL_ID @@ -46,7 +45,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
=== "NVIDIA" - +
```yaml @@ -56,7 +55,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B image: lmsysorg/sglang:latest env: - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - + commands: - python3 -m sglang.launch_server --model-path $MODEL_ID @@ -81,9 +80,9 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc ```shell $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 - + # BACKEND REGION RESOURCES SPOT PRICE + 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 + Submit the run deepseek-r1-amd? [y/n]: y Provisioning... @@ -119,12 +118,12 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ ```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint is available at `https://gateway./`. ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/llms/deepseek/sglang` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang){:target="_blank"}. ## What's next? diff --git a/examples/inference/tgi/README.md b/examples/inference/tgi/README.md index 938154c24e..8630473dd9 100644 --- a/examples/inference/tgi/README.md +++ b/examples/inference/tgi/README.md @@ -8,14 +8,13 @@ description: "This example shows how to deploy Llama 4 Scout to any cloud or on- This example shows how to deploy Llama 4 Scout with `dstack` using [HuggingFace TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/index){:target="_blank"}. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -40,7 +39,7 @@ env: # max_batch_prefill_tokens must be >= max_input_tokens - MAX_BATCH_PREFILL_TOKENS=8192 commands: - # Activate the virtual environment at /usr/src/.venv/ + # Activate the virtual environment at /usr/src/.venv/ # as required by TGI's latest image. - . /usr/src/.venv/bin/activate - NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher @@ -64,7 +63,7 @@ resources: ### Running a configuration -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. +To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
@@ -72,9 +71,9 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc $ HF_TOKEN=... $ dstack apply -f examples/inference/tgi/.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87 - 2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98 + # BACKEND REGION RESOURCES SPOT PRICE + 1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87 + 2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98 Submit the run llama4-scout? [y/n]: y @@ -83,7 +82,7 @@ Provisioning... ```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint +If no gateway is created, the model will be available via the OpenAI-compatible endpoint at `/proxy/models//`.
@@ -111,12 +110,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint is available at `https://gateway./`. ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/inference/tgi` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi). ## What's next? diff --git a/examples/inference/trtllm/README.md b/examples/inference/trtllm/README.md index d84141a387..3d29ab0d91 100644 --- a/examples/inference/trtllm/README.md +++ b/examples/inference/trtllm/README.md @@ -9,14 +9,13 @@ This example shows how to deploy both DeepSeek R1 and its distilled version using [TensorRT-LLM :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/TensorRT-LLM){:target="_blank"} and `dstack`. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -72,8 +71,8 @@ To run it, pass the task configuration to `dstack apply`. ```shell $ dstack apply -f examples/inference/trtllm/build-image.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 cudo ca-montreal-2 8xCPU, 25GB, (500.0GB) yes $0.1073 + # BACKEND REGION RESOURCES SPOT PRICE + 1 cudo ca-montreal-2 8xCPU, 25GB, (500.0GB) yes $0.1073 Submit the run build-image? [y/n]: y @@ -93,7 +92,7 @@ Below is the service configuration that deploys DeepSeek R1 using the built Tens name: serve-r1 # Specify the image built with `examples/inference/trtllm/build-image.dstack.yml` - image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167 + image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167 env: - MAX_BATCH_SIZE=256 - MAX_NUM_TOKENS=16384 @@ -125,15 +124,15 @@ Below is the service configuration that deploys DeepSeek R1 using the built Tens
-To run it, pass the configuration to `dstack apply`. +To run it, pass the configuration to `dstack apply`.
```shell $ dstack apply -f examples/inference/trtllm/serve-r1.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai is-iceland 192xCPU, 2063GB, 8xH200 (141GB) yes $25.62 + # BACKEND REGION RESOURCES SPOT PRICE + 1 vastai is-iceland 192xCPU, 2063GB, 8xH200 (141GB) yes $25.62 Submit the run serve-r1? [y/n]: y @@ -149,7 +148,7 @@ To deploy DeepSeek R1 Distill Llama 8B, follow the steps below. #### Convert and upload checkpoints -Here’s the task config that converts a Hugging Face model to a TensorRT-LLM checkpoint format +Here’s the task config that converts a Hugging Face model to a TensorRT-LLM checkpoint format and uploads it to S3 using the provided AWS credentials.
@@ -168,7 +167,7 @@ and uploads it to S3 using the provided AWS credentials. - AWS_DEFAULT_REGION commands: # nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 container uses TensorRT-LLM version 0.17.0, - # therefore we are using branch v0.17.0 + # therefore we are using branch v0.17.0 - git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git - git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git - git clone https://github.com/triton-inference-server/server.git @@ -192,15 +191,15 @@ and uploads it to S3 using the provided AWS credentials.
-To run it, pass the configuration to `dstack apply`. +To run it, pass the configuration to `dstack apply`.
```shell $ dstack apply -f examples/inference/trtllm/convert-model.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 + # BACKEND REGION RESOURCES SPOT PRICE + 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 Submit the run convert-model? [y/n]: y @@ -228,7 +227,7 @@ Here’s the task config that builds a TensorRT-LLM model and uploads it to S3 w - AWS_SECRET_ACCESS_KEY - AWS_DEFAULT_REGION - MAX_SEQ_LEN=8192 # Sum of Max Input Length & Max Output Length - - MAX_INPUT_LEN=4096 + - MAX_INPUT_LEN=4096 - MAX_BATCH_SIZE=256 - TRITON_MAX_BATCH_SIZE=1 - INSTANCE_COUNT=1 @@ -260,15 +259,15 @@ Here’s the task config that builds a TensorRT-LLM model and uploads it to S3 w ```
-To run it, pass the configuration to `dstack apply`. +To run it, pass the configuration to `dstack apply`.
```shell $ dstack apply -f examples/inference/trtllm/build-model.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 + # BACKEND REGION RESOURCES SPOT PRICE + 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 Submit the run build-model? [y/n]: y @@ -302,25 +301,25 @@ Below is the service configuration that deploys DeepSeek R1 Distill Llama 8B. - ./aws/install - aws s3 sync s3://${S3_BUCKET_NAME}/tllm_engine_1gpu_bf16 ./tllm_engine_1gpu_bf16 - git clone https://github.com/triton-inference-server/server.git - - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo --tokenizer tokenizer_dir --openai-port 8000 + - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo --tokenizer tokenizer_dir --openai-port 8000 port: 8000 model: ensemble resources: gpu: A100:40GB - + ```
-To run it, pass the configuration to `dstack apply`. +To run it, pass the configuration to `dstack apply`.
```shell $ dstack apply -f examples/inference/trtllm/serve-distill.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 + # BACKEND REGION RESOURCES SPOT PRICE + 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904 Submit the run serve-distill? [y/n]: y @@ -331,7 +330,7 @@ Provisioning... ## Access the endpoint -If no gateway is created, the model will be available via the OpenAI-compatible endpoint +If no gateway is created, the model will be available via the OpenAI-compatible endpoint at `/proxy/models//`.
@@ -360,12 +359,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint is available at `https://gateway./`. ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/inference/trtllm` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/trtllm){:target="_blank"}. ## What's next? diff --git a/examples/inference/vllm/README.md b/examples/inference/vllm/README.md index 57c6758301..d646ea2874 100644 --- a/examples/inference/vllm/README.md +++ b/examples/inference/vllm/README.md @@ -7,14 +7,13 @@ description: "This example shows how to deploy Llama 3.1 to any cloud or on-prem This example shows how to deploy Llama 3.1 8B with `dstack` using [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/){:target="_blank"}. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -60,14 +59,14 @@ resources: ### Running a configuration -To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command. +To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
```shell $ dstack apply -f examples/inference/vllm/.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE + # BACKEND REGION RESOURCES SPOT PRICE 1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12 2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12 3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23 @@ -79,7 +78,7 @@ Provisioning... ```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint +If no gateway is created, the model will be available via the OpenAI-compatible endpoint at `/proxy/models//`.
@@ -107,12 +106,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint is available at `https://gateway./`. ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/inference/vllm` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm). ## What's next? diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md index b1390fd525..ac098fa70c 100644 --- a/examples/llms/deepseek/README.md +++ b/examples/llms/deepseek/README.md @@ -2,19 +2,18 @@ This example walks you through how to deploy and train [Deepseek :material-arrow-top-right-thin:{ .external }](https://huggingface.co/deepseek-ai){:target="_blank"} -models with `dstack`. +models with `dstack`. > We used Deepseek-R1 distilled models and Deepseek-V2-Lite, a 16B model with the same architecture as Deepseek-R1 (671B). Deepseek-V2-Lite retains MLA and DeepSeekMoE but requires less memory, making it ideal for testing and fine-tuning on smaller GPUs. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -52,13 +51,13 @@ Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` usin
=== "vLLM" - +
```yaml type: service name: deepseek-r1-amd - + image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 env: - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B @@ -68,7 +67,7 @@ Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` usin --max-model-len $MAX_MODEL_LEN --trust-remote-code port: 8000 - + model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B resources: @@ -83,7 +82,7 @@ Note, when using `Deepseek-R1-Distill-Llama-70B` with `vLLM` with a 192GB GPU, w Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` using [TGI on Gaudi :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/tgi-gaudi){:target="_blank"} -and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/HabanaAI/vllm-fork){:target="_blank"} (Gaudi fork) with Intel Gaudi 2. +and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/HabanaAI/vllm-fork){:target="_blank"} (Gaudi fork) with Intel Gaudi 2. > Both [TGI on Gaudi :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/tgi-gaudi){:target="_blank"} > and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/HabanaAI/vllm-fork){:target="_blank"} do not support `Deepseek-V2-Lite`. @@ -151,7 +150,7 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/Haban env: - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B - HABANA_VISIBLE_DEVICES=all - - OMPI_MCA_btl_vader_single_copy_mechanism=none + - OMPI_MCA_btl_vader_single_copy_mechanism=none commands: - git clone https://github.com/HabanaAI/vllm-fork.git @@ -166,13 +165,13 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/Haban port: 8000 ``` -
+
### NVIDIA Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-8B` using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} -and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} with NVIDIA GPUs. +and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} with NVIDIA GPUs. Both SGLang and vLLM also support `Deepseek-V2-Lite`. === "SGLang" @@ -181,7 +180,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`. ```yaml type: service name: deepseek-r1-nvidia - + image: lmsysorg/sglang:latest env: - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B @@ -190,10 +189,10 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`. --model-path $MODEL_ID --port 8000 --trust-remote-code - + port: 8000 model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - + resources: gpu: 24GB ``` @@ -205,17 +204,17 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`. ```yaml type: service name: deepseek-r1-nvidia - + image: vllm/vllm-openai:latest env: - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B - MAX_MODEL_LEN=4096 commands: - vllm serve $MODEL_ID - --max-model-len $MAX_MODEL_LEN - port: 8000 + --max-model-len $MAX_MODEL_LEN + port: 8000 model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - + resources: gpu: 24GB ``` @@ -253,9 +252,9 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc ```shell $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 - + # BACKEND REGION RESOURCES SPOT PRICE + 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49 + Submit the run deepseek-r1-amd? [y/n]: y Provisioning... @@ -291,7 +290,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \ ```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint is available at `https://gateway./`. ## Fine-tuning @@ -371,19 +370,21 @@ Here are the examples of LoRA fine-tuning of `Deepseek-V2-Lite` and GRPO fine-tu type: task name: trl-train-grpo - image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0 + image: rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0 env: - WANDB_API_KEY - WANDB_PROJECT - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B + files: + - grpo_train.py commands: - pip install trl - pip install datasets # numPy version less than 2 is required for the scipy installation with AMD. - pip install "numpy<2" - mkdir -p grpo_example - - cp examples/llms/deepseek/trl/amd/grpo_train.py grpo_example/grpo_train.py + - cp grpo_train.py grpo_example/grpo_train.py - cd grpo_example - python grpo_train.py --model_name_or_path $MODEL_ID @@ -529,43 +530,43 @@ on NVIDIA GPU using HuggingFace's [TRL :material-arrow-top-right-thin:{ .externa - pip install bitsandbytes - cd peft/examples/sft - python train.py - --seed 100 - --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite" - --dataset_name "smangrul/ultrachat-10k-chatml" - --chat_template_format "chatml" - --add_special_tokens False - --append_concat_token False - --splits "train,test" - --max_seq_len 512 - --num_train_epochs 1 - --logging_steps 5 - --log_level "info" - --logging_strategy "steps" - --eval_strategy "epoch" - --save_strategy "epoch" - --hub_private_repo True - --hub_strategy "every_save" - --bf16 True - --packing True - --learning_rate 1e-4 - --lr_scheduler_type "cosine" - --weight_decay 1e-4 - --warmup_ratio 0.0 - --max_grad_norm 1.0 - --output_dir "mistral-sft-lora" - --per_device_train_batch_size 8 - --per_device_eval_batch_size 8 - --gradient_accumulation_steps 4 - --gradient_checkpointing True - --use_reentrant True - --dataset_text_field "content" - --use_peft_lora True - --lora_r 16 - --lora_alpha 16 - --lora_dropout 0.05 - --lora_target_modules "all-linear" - --use_4bit_quantization True - --use_nested_quant True + --seed 100 + --model_name_or_path "deepseek-ai/DeepSeek-V2-Lite" + --dataset_name "smangrul/ultrachat-10k-chatml" + --chat_template_format "chatml" + --add_special_tokens False + --append_concat_token False + --splits "train,test" + --max_seq_len 512 + --num_train_epochs 1 + --logging_steps 5 + --log_level "info" + --logging_strategy "steps" + --eval_strategy "epoch" + --save_strategy "epoch" + --hub_private_repo True + --hub_strategy "every_save" + --bf16 True + --packing True + --learning_rate 1e-4 + --lr_scheduler_type "cosine" + --weight_decay 1e-4 + --warmup_ratio 0.0 + --max_grad_norm 1.0 + --output_dir "mistral-sft-lora" + --per_device_train_batch_size 8 + --per_device_eval_batch_size 8 + --gradient_accumulation_steps 4 + --gradient_checkpointing True + --use_reentrant True + --dataset_text_field "content" + --use_peft_lora True + --lora_r 16 + --lora_alpha 16 + --lora_dropout 0.05 + --lora_target_modules "all-linear" + --use_4bit_quantization True + --use_nested_quant True --bnb_4bit_compute_dtype "bfloat16" resources: @@ -598,10 +599,9 @@ needs 7–10GB due to intermediate hidden states. ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/llms/deepseek` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek). !!! info "What's next?" - 1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), + 1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). - diff --git a/examples/llms/deepseek/trl/amd/grpo.dstack.yml b/examples/llms/deepseek/trl/amd/grpo.dstack.yml index f866bb1ca0..c1a76e528b 100644 --- a/examples/llms/deepseek/trl/amd/grpo.dstack.yml +++ b/examples/llms/deepseek/trl/amd/grpo.dstack.yml @@ -9,15 +9,15 @@ env: - WANDB_API_KEY - WANDB_PROJECT - MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B +# Mount files +files: + - grpo_train.py # Commands of the task commands: - pip install trl - pip install datasets # numpy version less than 2 is required for the scipy installation with AMD. - pip install "numpy<2" - - mkdir -p grpo_example - - cp examples/llms/deepseek/trl/amd/grpo_train.py grpo_example/grpo_train.py - - cd grpo_example - python grpo_train.py --model_name_or_path $MODEL_ID --dataset_name trl-lib/tldr diff --git a/examples/llms/llama/README.md b/examples/llms/llama/README.md index 7fe051c2f7..89e716d403 100644 --- a/examples/llms/llama/README.md +++ b/examples/llms/llama/README.md @@ -3,14 +3,13 @@ This example walks you through how to deploy Llama 4 Scout model with `dstack`. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -18,9 +17,9 @@ This example walks you through how to deploy Llama 4 Scout model with `dstack`. ## Deployment ### AMD -Here's an example of a service that deploys -[`Llama-4-Scout-17B-16E-Instruct` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct){:target="_blank"} -using [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} +Here's an example of a service that deploys +[`Llama-4-Scout-17B-16E-Instruct` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct){:target="_blank"} +using [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} with AMD `MI300X` GPUs.
@@ -35,7 +34,7 @@ env: - MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct - VLLM_WORKER_MULTIPROC_METHOD=spawn - VLLM_USE_MODELSCOPE=False - - VLLM_USE_TRITON_FLASH_ATTN=0 + - VLLM_USE_TRITON_FLASH_ATTN=0 - MAX_MODEL_LEN=256000 commands: @@ -47,7 +46,7 @@ commands: --max-num-seqs 64 \ --override-generation-config='{"attn_temperature_tuning": true}' - + port: 8000 # Register the model model: meta-llama/Llama-4-Scout-17B-16E-Instruct @@ -59,9 +58,9 @@ resources:
### NVIDIA -Here's an example of a service that deploys -[`Llama-4-Scout-17B-16E-Instruct` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct){:target="_blank"} -using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} +Here's an example of a service that deploys +[`Llama-4-Scout-17B-16E-Instruct` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct){:target="_blank"} +using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} with NVIDIA `H200` GPUs. === "SGLang" @@ -96,7 +95,7 @@ with NVIDIA `H200` GPUs.
=== "vLLM" - +
```yaml @@ -128,12 +127,12 @@ with NVIDIA `H200` GPUs.
!!! info "NOTE:" - With vLLM, add `--override-generation-config='{"attn_temperature_tuning": true}'` to + With vLLM, add `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for [contexts longer than 32K tokens :material-arrow-top-right-thin:{ .external }](https://blog.vllm.ai/2025/04/05/llama4.html){:target="_blank"}. ### Memory requirements -Below are the approximate memory requirements for loading the model. +Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations. | Model | Size | FP16 | FP8 | INT4 | @@ -153,11 +152,11 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc $ HF_TOKEN=... $ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87 - 2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98 + # BACKEND REGION RESOURCES SPOT PRICE + 1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87 + 2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98 + - Submit the run llama4-scout? [y/n]: y Provisioning... @@ -195,7 +194,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint is available at `https://./`. [//]: # (TODO: https://github.com/dstackai/dstack/issues/1777) @@ -224,9 +223,9 @@ env: # Commands of the task commands: - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml - - axolotl train scout-qlora-fsdp1.yaml - --wandb-project $WANDB_PROJECT - --wandb-name $WANDB_NAME + - axolotl train scout-qlora-fsdp1.yaml + --wandb-project $WANDB_PROJECT + --wandb-name $WANDB_NAME --hub-model-id $HUB_MODEL_ID resources: @@ -242,7 +241,7 @@ The task uses Axolotl's Docker image, where Axolotl is already pre-installed. ### Memory requirements -Below are the approximate memory requirements for loading the model. +Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations. | Model | Size | Full fine-tuning | LoRA | QLoRA | @@ -279,11 +278,11 @@ $ dstack apply -f examples/single-node-training/axolotl/.dstack.yml ## Source code -The source-code for deployment examples can be found in +The source-code for deployment examples can be found in [`examples/llms/llama` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/llama) and the source-code for the finetuning example can be found in [`examples/single-node-training/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl){:target="_blank"}. ## What's next? -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), +1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). 2. Browse [Llama 4 with SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang/blob/main/docs/references/llama4.md), [Llama 4 with vLLM :material-arrow-top-right-thin:{ .external }](https://blog.vllm.ai/2025/04/05/llama4.html), [Llama 4 with AMD :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/llama4-day-0-support/README.html) and [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/OpenAccess-AI-Collective/axolotl){:target="_blank"}. diff --git a/examples/llms/llama31/README.md b/examples/llms/llama31/README.md index 66bf686faf..bc07d74da9 100644 --- a/examples/llms/llama31/README.md +++ b/examples/llms/llama31/README.md @@ -3,14 +3,13 @@ This example walks you through how to deploy and fine-tuning Llama 3.1 with `dstack`. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -22,12 +21,12 @@ Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NI === "vLLM" -
+
```yaml type: service name: llama31 - + python: "3.11" env: - HF_TOKEN @@ -41,14 +40,14 @@ Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NI port: 8000 # Register the model model: meta-llama/Meta-Llama-3.1-8B-Instruct - + # Uncomment to leverage spot instances #spot_policy: auto # Uncomment to cache downloaded models #volumes: # - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub - + resources: gpu: 24GB # Uncomment if using multiple GPUs @@ -59,12 +58,12 @@ Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NI === "TGI" -
+
```yaml type: service name: llama31 - + image: ghcr.io/huggingface/text-generation-inference:latest env: - HF_TOKEN @@ -76,14 +75,14 @@ Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NI port: 80 # Register the model model: meta-llama/Meta-Llama-3.1-8B-Instruct - + # Uncomment to leverage spot instances #spot_policy: auto - # Uncomment to cache downloaded models + # Uncomment to cache downloaded models #volumes: # - /data:/data - + resources: gpu: 24GB # Uncomment if using multiple GPUs @@ -94,12 +93,12 @@ Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NI === "NIM" -
+
```yaml type: service name: llama31 - + image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest env: - NGC_API_KEY @@ -110,14 +109,14 @@ Here's an example of a service that deploys Llama 3.1 8B using vLLM, TGI, and NI port: 8000 # Register the model model: meta/llama-3.1-8b-instruct - + # Uncomment to leverage spot instances #spot_policy: auto - + # Cache downloaded models volumes: - /root/.cache/nim:/opt/nim/.cache - + resources: gpu: 24GB # Uncomment if using multiple GPUs @@ -130,7 +129,7 @@ Note, when using Llama 3.1 8B with a 24GB GPU, we must limit the context size to ### Memory requirements -Below are the approximate memory requirements for loading the model. +Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations. | Model size | FP16 | FP8 | INT4 | @@ -171,7 +170,7 @@ $ dstack apply -f examples/llms/llama31/vllm/.dstack.yml 1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12 2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12 3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23 - + Submit the run llama31? [y/n]: y Provisioning... @@ -208,7 +207,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint is available at `https://gateway./`. [//]: # (TODO: How to prompting and tool calling) @@ -222,14 +221,14 @@ is available at `https://gateway./`. Below is the task configuration file of fine-tuning Llama 3.1 8B using TRL on the [`OpenAssistant/oasst_top1_2023-08-25` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25) dataset. -
+
```yaml type: task name: trl-train python: 3.12 -# Ensure nvcc is installed (req. for Flash Attention) +# Ensure nvcc is installed (req. for Flash Attention) nvcc: true env: - HF_TOKEN @@ -245,7 +244,7 @@ commands: - pip install . - accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml - --num_processes $DSTACK_GPUS_PER_NODE + --num_processes $DSTACK_GPUS_PER_NODE examples/scripts/sft.py --model_name meta-llama/Meta-Llama-3.1-8B --dataset_name OpenAssistant/oasst_top1_2023-08-25 @@ -278,7 +277,7 @@ shm_size: 24GB
-Change the `resources` property to specify more GPUs. +Change the `resources` property to specify more GPUs. ### Memory requirements @@ -296,7 +295,7 @@ The requirements can be significantly reduced with certain optimizations. For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3. -To do this, use the `examples/accelerate_configs/deepspeed_zero3.yaml` configuration file instead of +To do this, use the `examples/accelerate_configs/deepspeed_zero3.yaml` configuration file instead of `examples/accelerate_configs/multi_gpu.yaml`. ### Running on multiple nodes @@ -304,7 +303,7 @@ To do this, use the `examples/accelerate_configs/deepspeed_zero3.yaml` configura In case the model doesn't feet into a single GPU, consider running a `dstack` task on multiple nodes. Below is the corresponding task configuration file. -
+
```yaml type: task @@ -314,7 +313,7 @@ name: trl-train-distrib nodes: 2 python: "3.10" -# Ensure nvcc is installed (req. for Flash Attention) +# Ensure nvcc is installed (req. for Flash Attention) nvcc: true env: @@ -330,7 +329,7 @@ commands: - cd trl - pip install . - accelerate launch - --config_file=examples/accelerate_configs/fsdp_qlora.yaml + --config_file=examples/accelerate_configs/fsdp_qlora.yaml --main_process_ip=$DSTACK_MASTER_NODE_IP --main_process_port=8008 --machine_rank=$DSTACK_NODE_RANK @@ -374,14 +373,14 @@ resources: ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/llms/llama31` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/llama31) and [`examples/single-node-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl). ## What's next? -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), +1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). -2. Browse [Llama 3.1 on HuggingFace :material-arrow-top-right-thin:{ .external }](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f), - [HuggingFace's Llama recipes :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/huggingface-llama-recipes), - [Meta's Llama recipes :material-arrow-top-right-thin:{ .external }](https://github.com/meta-llama/llama-recipes) +2. Browse [Llama 3.1 on HuggingFace :material-arrow-top-right-thin:{ .external }](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f), + [HuggingFace's Llama recipes :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/huggingface-llama-recipes), + [Meta's Llama recipes :material-arrow-top-right-thin:{ .external }](https://github.com/meta-llama/llama-recipes) and [Llama Agentic System :material-arrow-top-right-thin:{ .external }](https://github.com/meta-llama/llama-agentic-system/). diff --git a/examples/llms/llama32/README.md b/examples/llms/llama32/README.md index 1a232eb7ba..484720ee25 100644 --- a/examples/llms/llama32/README.md +++ b/examples/llms/llama32/README.md @@ -3,14 +3,13 @@ This example walks you through how to deploy Llama 3.2 vision model with `dstack` using `vLLM`. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -19,7 +18,7 @@ This example walks you through how to deploy Llama 3.2 vision model with `dstack Here's an example of a service that deploys Llama 3.2 11B using vLLM. -
+
```yaml type: service @@ -56,7 +55,7 @@ resources: ### Memory requirements -Below are the approximate memory requirements for loading the model. +Below are the approximate memory requirements for loading the model. This excludes memory for the model context and CUDA kernel reservations. | Model size | FP16 | @@ -76,12 +75,12 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc $ HF_TOKEN=... $ dstack apply -f examples/llms/llama32/vllm/.dstack.yml - # BACKEND REGION RESOURCES SPOT PRICE - 1 runpod CA-MTL-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24 - 2 runpod EU-SE-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24 - 3 runpod EU-SE-1 9xCPU, 50GB, 1xA6000 (48GB) yes $0.25 + # BACKEND REGION RESOURCES SPOT PRICE + 1 runpod CA-MTL-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24 + 2 runpod EU-SE-1 9xCPU, 50GB, 1xA40 (48GB) yes $0.24 + 3 runpod EU-SE-1 9xCPU, 50GB, 1xA6000 (48GB) yes $0.25 + - Submit the run llama32? [y/n]: y Provisioning... @@ -115,19 +114,19 @@ $ curl http://127.0.0.1:3000/proxy/services/main/llama32/v1/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint +When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint is available at `https://./`. [//]: # (TODO: https://github.com/dstackai/dstack/issues/1777) ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/llms/llama32` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/llama32). ## What's next? -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), +1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). 2. Browse [Llama 3.2 on HuggingFace :material-arrow-top-right-thin:{ .external }](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf) and [LLama 3.2 on vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/models/supported_models.html#multimodal-language-models). diff --git a/examples/misc/airflow/README.md b/examples/misc/airflow/README.md index 13598d8e05..21687de743 100644 --- a/examples/misc/airflow/README.md +++ b/examples/misc/airflow/README.md @@ -29,8 +29,7 @@ def pipeline(...): return ( f"source {DSTACK_VENV_PATH}/bin/activate" f" && cd {DSTACK_REPO_PATH}" - " && dstack init" - " && dstack apply -y -f task.dstack.yml" + " && dstack apply -y -f task.dstack.yml --repo ." ) ``` @@ -78,5 +77,5 @@ def pipeline(...): ## Source code -The source code for this example can be found in +The source code for this example can be found in [`examples/misc/airflow` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/misc/airflow). diff --git a/examples/misc/airflow/dags/dstack_tasks.py b/examples/misc/airflow/dags/dstack_tasks.py index 8328b83fc0..30741dbcbb 100644 --- a/examples/misc/airflow/dags/dstack_tasks.py +++ b/examples/misc/airflow/dags/dstack_tasks.py @@ -38,7 +38,7 @@ def dstack_cli_apply() -> str: dstack is installed into the main Airflow environment. NOT RECOMMENDED since dstack and Airflow may have conflicting dependencies. """ - return f"cd {DSTACK_REPO_PATH} && dstack init && dstack apply -y -f task.dstack.yml" + return f"cd {DSTACK_REPO_PATH} && dstack apply -y -f task.dstack.yml --repo ." @task.bash def dstack_cli_apply_venv() -> str: @@ -49,8 +49,7 @@ def dstack_cli_apply_venv() -> str: return ( f"source {DSTACK_VENV_PATH}/bin/activate" f" && cd {DSTACK_REPO_PATH}" - " && dstack init" - " && dstack apply -y -f task.dstack.yml" + " && dstack apply -y -f task.dstack.yml --repo ." ) @task.external_python(task_id="external_python", python=DSTACK_VENV_PYTHON_BINARY_PATH) diff --git a/examples/misc/docker-compose/.dstack.yml b/examples/misc/docker-compose/.dstack.yml index 2cf006bbf2..0967f72b81 100644 --- a/examples/misc/docker-compose/.dstack.yml +++ b/examples/misc/docker-compose/.dstack.yml @@ -6,6 +6,8 @@ env: - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - HF_TOKEN ide: vscode +files: + - compose.yaml # Uncomment to leverage spot instances #spot_policy: auto diff --git a/examples/misc/docker-compose/README.md b/examples/misc/docker-compose/README.md index dc2b4dec31..262f2abfdf 100644 --- a/examples/misc/docker-compose/README.md +++ b/examples/misc/docker-compose/README.md @@ -8,14 +8,13 @@ serving [Llama-3.2-3B-Instruct :material-arrow-top-right-thin:{ .external }](htt using [Docker Compose :material-arrow-top-right-thin:{ .external }](https://docs.docker.com/compose/){:target="_blank"}. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
@@ -26,31 +25,32 @@ using [Docker Compose :material-arrow-top-right-thin:{ .external }](https://docs === "`task.dstack.yml`" -
- +
+ ```yaml type: task name: chat-ui-task - + docker: true env: - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - HF_TOKEN - working_dir: examples/misc/docker-compose + files: + - compose.yaml commands: - docker compose up ports: - 9000 - + resources: gpu: "nvidia:24GB" ``` - +
=== "`compose.yaml`" -
+
```yaml services: @@ -71,7 +71,7 @@ using [Docker Compose :material-arrow-top-right-thin:{ .external }](https://docs depends_on: - tgi - db - + tgi: image: ghcr.io/huggingface/text-generation-inference:sha-704a58c volumes: @@ -87,12 +87,12 @@ using [Docker Compose :material-arrow-top-right-thin:{ .external }](https://docs - driver: nvidia count: all capabilities: [gpu] - + db: image: mongo:latest volumes: - db_data:/data/db - + volumes: tgi_data: db_data: @@ -119,7 +119,7 @@ $ dstack apply -f examples/examples/misc/docker-compose/task.dstack.yml 1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12 2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12 3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23 - + Submit the run chat-ui-task? [y/n]: y Provisioning... @@ -133,7 +133,7 @@ Provisioning... To persist data between runs, create a [volume](https://dstack.ai/docs/concepts/volumes/) and attach it to the run configuration. -
+
```yaml type: task @@ -144,7 +144,8 @@ image: dstackai/dind env: - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - HF_TOKEN -working_dir: examples/misc/docker-compose +files: + - compose.yaml commands: - start-dockerd - docker compose up @@ -170,10 +171,10 @@ be persisted. ## Source code -The source-code of this example can be found in +The source-code of this example can be found in [`examples/misc/docker-compose` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/misc/docker-compose). ## What's next? -1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), +1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips). diff --git a/examples/misc/docker-compose/service.dstack.yml b/examples/misc/docker-compose/service.dstack.yml index 7234ce1b64..b33b900fbd 100644 --- a/examples/misc/docker-compose/service.dstack.yml +++ b/examples/misc/docker-compose/service.dstack.yml @@ -5,7 +5,8 @@ docker: true env: - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - HF_TOKEN -working_dir: examples/misc/docker-compose +files: + - compose.yaml commands: - docker compose up port: 9000 diff --git a/examples/misc/docker-compose/task.dstack.yml b/examples/misc/docker-compose/task.dstack.yml index 148b6a11dc..e7af43f383 100644 --- a/examples/misc/docker-compose/task.dstack.yml +++ b/examples/misc/docker-compose/task.dstack.yml @@ -5,7 +5,8 @@ docker: true env: - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - HF_TOKEN -working_dir: examples/misc/docker-compose +files: + - compose.yaml commands: - docker compose up ports: diff --git a/examples/single-node-training/axolotl/README.md b/examples/single-node-training/axolotl/README.md index bceee1a488..e99be93de0 100644 --- a/examples/single-node-training/axolotl/README.md +++ b/examples/single-node-training/axolotl/README.md @@ -3,21 +3,20 @@ This example shows how to use [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/OpenAccess-AI-Collective/axolotl){:target="_blank"} with `dstack` to fine-tune 4-bit Quantized `Llama-4-Scout-17B-16E` using SFT with FSDP and QLoRA. ??? info "Prerequisites" - Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`. + Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell $ git clone https://github.com/dstackai/dstack $ cd dstack - $ dstack init ```
## Define a configuration -Axolotl reads the model, QLoRA, and dataset arguments, as well as trainer configuration from a [`scout-qlora-fsdp1.yaml` :material-arrow-top-right-thin:{ .external }](https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-4/scout-qlora-fsdp1.yaml){:target="_blank"} file. The configuration uses 4-bit axolotl quantized version of `meta-llama/Llama-4-Scout-17B-16E`, requiring only ~43GB VRAM/GPU with 4K context length. +Axolotl reads the model, QLoRA, and dataset arguments, as well as trainer configuration from a [`scout-qlora-flexattn-fsdp2.yaml` :material-arrow-top-right-thin:{ .external }](https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-4/scout-qlora-flexattn-fsdp2.yaml){:target="_blank"} file. The configuration uses 4-bit axolotl quantized version of `meta-llama/Llama-4-Scout-17B-16E`, requiring only ~43GB VRAM/GPU with 4K context length. Below is a task configuration that does fine-tuning. @@ -37,12 +36,11 @@ env: - WANDB_API_KEY - WANDB_PROJECT - HUB_MODEL_ID - - DSTACK_RUN_NAME # Commands of the task commands: - - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml + - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-flexattn-fsdp2.yaml - | - axolotl train scout-qlora-fsdp1.yaml \ + axolotl train scout-qlora-flexattn-fsdp2.yaml \ --wandb-project $WANDB_PROJECT \ --wandb-name $DSTACK_RUN_NAME \ --hub-model-id $HUB_MODEL_ID @@ -76,9 +74,9 @@ $ WANDB_PROJECT=... $ HUB_MODEL_ID=... $ dstack apply -f examples/single-node-training/axolotl/.dstack.yml - # BACKEND RESOURCES INSTANCE TYPE PRICE - 1 vastai (cz-czechia) cpu=64 mem=128GB H100:80GB:2 18794506 $3.8907 - 2 vastai (us-texas) cpu=52 mem=64GB H100:80GB:2 20442365 $3.6926 + # BACKEND RESOURCES INSTANCE TYPE PRICE + 1 vastai (cz-czechia) cpu=64 mem=128GB H100:80GB:2 18794506 $3.8907 + 2 vastai (us-texas) cpu=52 mem=64GB H100:80GB:2 20442365 $3.6926 3 vastai (fr-france) cpu=64 mem=96GB H100:80GB:2 20379984 $3.7389 Submit the run axolotl-nvidia-llama-scout-train? [y/n]: @@ -97,6 +95,6 @@ The source-code of this example can be found in ## What's next? 1. Browse the [Axolotl distributed training](https://dstack.ai/docs/examples/distributed-training/axolotl) example -2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), +2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), [services](https://dstack.ai/docs/services), [fleets](https://dstack.ai/docs/concepts/fleets) 3. See the [AMD](https://dstack.ai/examples/accelerators/amd#axolotl) example diff --git a/examples/single-node-training/axolotl/config.yaml b/examples/single-node-training/axolotl/config.yaml deleted file mode 100644 index 57b7935f5b..0000000000 --- a/examples/single-node-training/axolotl/config.yaml +++ /dev/null @@ -1,82 +0,0 @@ -base_model: meta-llama/Meta-Llama-3-8B -model_type: LlamaForCausalLM -tokenizer_type: AutoTokenizer # PreTrainedTokenizerFast - -load_in_8bit: false -load_in_4bit: true -strict: false - -datasets: - - path: tatsu-lab/alpaca - type: alpaca -dataset_prepared_path: last_run_prepared -val_set_size: 0.05 -output_dir: ./out/qlora-llama3-8B - -adapter: qlora -lora_model_dir: - -sequence_len: 512 -sample_packing: false -pad_to_sequence_len: true - -lora_r: 8 -lora_alpha: 16 -lora_dropout: 0.05 -lora_target_modules: -lora_target_linear: true -lora_fan_in_fan_out: - -wandb_project: dstack+axolotl -wandb_entity: -wandb_watch: -wandb_name: llama3-8b-fp16-fsdp+qlora -wandb_log_model: - -gradient_accumulation_steps: 4 -micro_batch_size: 1 -num_epochs: 4 -optimizer: adamw_torch -lr_scheduler: cosine -learning_rate: 0.00001 - -train_on_inputs: false -group_by_length: false -bf16: auto -fp16: -tf32: false - -gradient_checkpointing: true -gradient_checkpointing_kwargs: - use_reentrant: true -early_stopping_patience: -resume_from_checkpoint: -local_rank: -logging_steps: 1 -xformers_attention: -flash_attention: true - -warmup_steps: 10 -evals_per_epoch: 4 -eval_table_size: -saves_per_epoch: 1 -debug: -deepspeed: -weight_decay: 0.0 -fsdp: - - full_shard - - auto_wrap -fsdp_config: - fsdp_limit_all_gathers: true - fsdp_sync_module_states: true - fsdp_offload_params: true - fsdp_use_orig_params: false - fsdp_cpu_ram_efficient_loading: true - fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP - fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer - fsdp_state_dict_type: FULL_STATE_DICT - fsdp_sharding_strategy: FULL_SHARD -special_tokens: - pad_token: <|end_of_text|> - -hub_model_id: peterschmidt85/axolotl_llama3_8b_fsdp_qlora diff --git a/examples/single-node-training/optimum-tpu/llama31/.dstack.yml b/examples/single-node-training/optimum-tpu/llama31/.dstack.yml index 482840c209..c93862e678 100644 --- a/examples/single-node-training/optimum-tpu/llama31/.dstack.yml +++ b/examples/single-node-training/optimum-tpu/llama31/.dstack.yml @@ -8,12 +8,17 @@ python: "3.11" env: - HF_TOKEN +# Mount files +files: + - train.py + - config.yaml + # Commands of the task commands: - git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git - mkdir -p optimum-tpu/examples/custom/ - - cp examples/single-node-training/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py - - cp examples/single-node-training/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml + - cp train.py optimum-tpu/examples/custom/train.py + - cp config.yaml optimum-tpu/examples/custom/config.yaml - cd optimum-tpu - pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html - pip install datasets evaluate diff --git a/examples/single-node-training/qlora/.dstack.yml b/examples/single-node-training/qlora/.dstack.yml index 090912e49f..7d87f630aa 100644 --- a/examples/single-node-training/qlora/.dstack.yml +++ b/examples/single-node-training/qlora/.dstack.yml @@ -6,10 +6,14 @@ env: - HF_TOKEN - HF_HUB_ENABLE_HF_TRANSFER=1 +files: + - requirements.txt + - train.py + commands: - - pip install -r examples/single-node-training/qlora/requirements.txt + - pip install -r requirements.txt - tensorboard --logdir results/runs & - - python examples/single-node-training/qlora/train.py --merge_and_push ${{ run.args }} + - python train.py --merge_and_push ${{ run.args }} ports: - 6006 diff --git a/examples/single-node-training/trl/amd/.dstack.yml b/examples/single-node-training/trl/amd/.dstack.yml index ecc3845f9a..8e6baad788 100644 --- a/examples/single-node-training/trl/amd/.dstack.yml +++ b/examples/single-node-training/trl/amd/.dstack.yml @@ -9,6 +9,9 @@ image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04 env: - HF_TOKEN +files: + - train.py + commands: - export PATH=/opt/conda/envs/py_3.10/bin:$PATH - git clone https://github.com/ROCm/bitsandbytes @@ -22,7 +25,7 @@ commands: - pip install peft - pip install transformers datasets huggingface-hub scipy - cd .. - - python examples/single-node-training/trl/amd/train.py + - python train.py # Uncomment to leverage spot instances #spot_policy: auto