+
```yaml
type: service
name: llama31-service-vllm-tpu
@@ -79,17 +79,17 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-
- pip install -r requirements-tpu.txt
- apt-get install -y libopenblas-base libopenmpi-dev libomp-dev
- python setup.py develop
- - vllm serve $MODEL_ID
- --tensor-parallel-size 4
+ - vllm serve $MODEL_ID
+ --tensor-parallel-size 4
--max-model-len $MAX_MODEL_LEN
--port 8000
port: 8000
# Register the model
model: meta-llama/Meta-Llama-3.1-8B-Instruct
-
+
# Uncomment to leverage spot instances
#spot_policy: auto
-
+
resources:
gpu: v5litepod-4
```
@@ -123,11 +123,11 @@ cloud resources and run the configuration.
## Fine-tuning with Optimum TPU
-Below is an example of fine-tuning Llama 3.1 8B using [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"}
+Below is an example of fine-tuning Llama 3.1 8B using [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"}
and the [`Abirate/english_quotes` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/Abirate/english_quotes){:target="_blank"}
dataset.
-
+
```yaml
type: task
@@ -136,11 +136,14 @@ name: optimum-tpu-llama-train
python: "3.11"
env:
- HF_TOKEN
+files:
+ - train.py
+ - config.yaml
commands:
- git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git
- mkdir -p optimum-tpu/examples/custom/
- - cp examples/single-node-training/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py
- - cp examples/single-node-training/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml
+ - cp train.py optimum-tpu/examples/custom/train.py
+ - cp config.yaml optimum-tpu/examples/custom/config.yaml
- cd optimum-tpu
- pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html
- pip install datasets evaluate
@@ -178,7 +181,7 @@ Note, `v5litepod` is optimized for fine-tuning transformer-based models. Each co
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/inference/tgi/tpu` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/tpu){:target="_blank"},
[`examples/inference/vllm/tpu` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/tpu){:target="_blank"},
and [`examples/single-node-training/optimum-tpu` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl){:target="_blank"}.
@@ -188,5 +191,5 @@ and [`examples/single-node-training/optimum-tpu` :material-arrow-top-right-thin:
1. Browse [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu),
[Optimum TPU TGI :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) and
[vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/tpu-installation.html).
-2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
+2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
[services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/concepts/fleets).
diff --git a/examples/distributed-training/axolotl/README.md b/examples/distributed-training/axolotl/README.md
index dd4b7cdb04..17efaf1e1a 100644
--- a/examples/distributed-training/axolotl/README.md
+++ b/examples/distributed-training/axolotl/README.md
@@ -3,14 +3,13 @@
This example walks you through how to run distributed fine-tune using [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/axolotl-ai-cloud/axolotl){:target="_blank"} with `dstack`.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -67,7 +66,7 @@ commands:
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM
-
+
resources:
gpu: 80GB:8
shm_size: 128GB
@@ -93,10 +92,10 @@ $ WANDB_PROJECT=...
$ HUB_MODEL_ID=...
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
- # BACKEND RESOURCES INSTANCE TYPE PRICE
- 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
- 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
-
+ # BACKEND RESOURCES INSTANCE TYPE PRICE
+ 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
+ 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
+
Submit the run trl-train-fsdp-distrib? [y/n]: y
Provisioning...
@@ -106,10 +105,10 @@ Provisioning...
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/distributed-training/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/axolotl).
!!! info "What's next?"
1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide
- 2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
+ 2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
[services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets)
diff --git a/examples/distributed-training/trl/README.md b/examples/distributed-training/trl/README.md
index 7ac67047e8..3e3977c89e 100644
--- a/examples/distributed-training/trl/README.md
+++ b/examples/distributed-training/trl/README.md
@@ -3,14 +3,13 @@
This example walks you through how to run distributed fine-tune using [TRL :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/trl){:target="_blank"}, [Accelerate :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/accelerate){:target="_blank"} and [Deepspeed :material-arrow-top-right-thin:{ .external }](https://github.com/deepspeedai/DeepSpeed){:target="_blank"}.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -41,7 +40,7 @@ Once the fleet is created, define a distributed task configuration. Here's an ex
- WANDB_API_KEY
- MODEL_ID=meta-llama/Llama-3.1-8B
- HUB_MODEL_ID
-
+
commands:
- pip install transformers bitsandbytes peft wandb
- git clone https://github.com/huggingface/trl
@@ -98,7 +97,7 @@ Once the fleet is created, define a distributed task configuration. Here's an ex
- HUB_MODEL_ID
- MODEL_ID=meta-llama/Llama-3.1-8B
- ACCELERATE_LOG_LEVEL=info
-
+
commands:
- pip install transformers bitsandbytes peft wandb deepspeed
- git clone https://github.com/huggingface/trl
@@ -153,10 +152,10 @@ $ WANDB_API_KEY=...
$ HUB_MODEL_ID=...
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
- # BACKEND RESOURCES INSTANCE TYPE PRICE
- 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
- 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
-
+ # BACKEND RESOURCES INSTANCE TYPE PRICE
+ 1 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
+ 2 ssh (remote) cpu=208 mem=1772GB H100:80GB:8 instance $0 idle
+
Submit the run trl-train-fsdp-distrib? [y/n]: y
Provisioning...
@@ -166,11 +165,10 @@ Provisioning...
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl){:target="_blank"}.
!!! info "What's next?"
1. Read the [clusters](https://dstack.ai/docs/guides/clusters) guide
- 2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
+ 2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
[services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets)
-
diff --git a/examples/inference/nim/README.md b/examples/inference/nim/README.md
index ba68018000..fe520e36bd 100644
--- a/examples/inference/nim/README.md
+++ b/examples/inference/nim/README.md
@@ -3,19 +3,18 @@ title: NVIDIA NIM
description: "This example shows how to deploy DeepSeek-R1-Distill-Llama-8B to any cloud or on-premises environment using NVIDIA NIM and dstack."
---
-# NVIDIA NIM
+# NVIDIA NIM
This example shows how to deploy DeepSeek-R1-Distill-Llama-8B using [NVIDIA NIM :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html){:target="_blank"} and `dstack`.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -59,7 +58,7 @@ resources:
### Running a configuration
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
@@ -67,10 +66,10 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc
$ NGC_API_KEY=...
$ dstack apply -f examples/inference/nim/.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199
- 2 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199
- 3 vultr nrt 6xCPU, 60GB, 1xA100 (40GB) no $1.199
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199
+ 2 vultr ewr 6xCPU, 60GB, 1xA100 (40GB) no $1.199
+ 3 vultr nrt 6xCPU, 60GB, 1xA100 (40GB) no $1.199
Submit the run serve-distill-deepseek? [y/n]: y
@@ -79,7 +78,7 @@ Provisioning...
```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
+If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `
/proxy/models//`.
@@ -107,12 +106,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway./`.
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/inference/nim` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/nim){:target="_blank"}.
## What's next?
diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
index f945e8db5d..f880ac30b7 100644
--- a/examples/inference/sglang/README.md
+++ b/examples/inference/sglang/README.md
@@ -3,14 +3,13 @@
This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} and `dstack`.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -19,7 +18,7 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGL
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang.
=== "AMD"
-
+
```yaml
@@ -29,7 +28,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
image: lmsysorg/sglang:v0.4.1.post4-rocm620
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
-
+
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
@@ -46,7 +45,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
=== "NVIDIA"
-
+
```yaml
@@ -56,7 +55,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
image: lmsysorg/sglang:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
+
commands:
- python3 -m sglang.launch_server
--model-path $MODEL_ID
@@ -81,9 +80,9 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc
```shell
$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49
-
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49
+
Submit the run deepseek-r1-amd? [y/n]: y
Provisioning...
@@ -119,12 +118,12 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway./`.
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/llms/deepseek/sglang` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang){:target="_blank"}.
## What's next?
diff --git a/examples/inference/tgi/README.md b/examples/inference/tgi/README.md
index 938154c24e..8630473dd9 100644
--- a/examples/inference/tgi/README.md
+++ b/examples/inference/tgi/README.md
@@ -8,14 +8,13 @@ description: "This example shows how to deploy Llama 4 Scout to any cloud or on-
This example shows how to deploy Llama 4 Scout with `dstack` using [HuggingFace TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/index){:target="_blank"}.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -40,7 +39,7 @@ env:
# max_batch_prefill_tokens must be >= max_input_tokens
- MAX_BATCH_PREFILL_TOKENS=8192
commands:
- # Activate the virtual environment at /usr/src/.venv/
+ # Activate the virtual environment at /usr/src/.venv/
# as required by TGI's latest image.
- . /usr/src/.venv/bin/activate
- NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
@@ -64,7 +63,7 @@ resources:
### Running a configuration
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
@@ -72,9 +71,9 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc
$ HF_TOKEN=...
$ dstack apply -f examples/inference/tgi/.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87
- 2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87
+ 2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98
Submit the run llama4-scout? [y/n]: y
@@ -83,7 +82,7 @@ Provisioning...
```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
+If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `/proxy/models//`.
@@ -111,12 +110,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway./`.
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/inference/tgi` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi).
## What's next?
diff --git a/examples/inference/trtllm/README.md b/examples/inference/trtllm/README.md
index d84141a387..3d29ab0d91 100644
--- a/examples/inference/trtllm/README.md
+++ b/examples/inference/trtllm/README.md
@@ -9,14 +9,13 @@ This example shows how to deploy both DeepSeek R1 and its distilled version
using [TensorRT-LLM :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/TensorRT-LLM){:target="_blank"} and `dstack`.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -72,8 +71,8 @@ To run it, pass the task configuration to `dstack apply`.
```shell
$ dstack apply -f examples/inference/trtllm/build-image.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 cudo ca-montreal-2 8xCPU, 25GB, (500.0GB) yes $0.1073
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 cudo ca-montreal-2 8xCPU, 25GB, (500.0GB) yes $0.1073
Submit the run build-image? [y/n]: y
@@ -93,7 +92,7 @@ Below is the service configuration that deploys DeepSeek R1 using the built Tens
name: serve-r1
# Specify the image built with `examples/inference/trtllm/build-image.dstack.yml`
- image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167
+ image: dstackai/tensorrt_llm:9b931c0f6305aefa3660e6fb84a76a42c0eef167
env:
- MAX_BATCH_SIZE=256
- MAX_NUM_TOKENS=16384
@@ -125,15 +124,15 @@ Below is the service configuration that deploys DeepSeek R1 using the built Tens
-To run it, pass the configuration to `dstack apply`.
+To run it, pass the configuration to `dstack apply`.
```shell
$ dstack apply -f examples/inference/trtllm/serve-r1.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai is-iceland 192xCPU, 2063GB, 8xH200 (141GB) yes $25.62
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 vastai is-iceland 192xCPU, 2063GB, 8xH200 (141GB) yes $25.62
Submit the run serve-r1? [y/n]: y
@@ -149,7 +148,7 @@ To deploy DeepSeek R1 Distill Llama 8B, follow the steps below.
#### Convert and upload checkpoints
-Here’s the task config that converts a Hugging Face model to a TensorRT-LLM checkpoint format
+Here’s the task config that converts a Hugging Face model to a TensorRT-LLM checkpoint format
and uploads it to S3 using the provided AWS credentials.
@@ -168,7 +167,7 @@ and uploads it to S3 using the provided AWS credentials.
- AWS_DEFAULT_REGION
commands:
# nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 container uses TensorRT-LLM version 0.17.0,
- # therefore we are using branch v0.17.0
+ # therefore we are using branch v0.17.0
- git clone --branch v0.17.0 --depth 1 https://github.com/triton-inference-server/tensorrtllm_backend.git
- git clone --branch v0.17.0 --single-branch https://github.com/NVIDIA/TensorRT-LLM.git
- git clone https://github.com/triton-inference-server/server.git
@@ -192,15 +191,15 @@ and uploads it to S3 using the provided AWS credentials.
-To run it, pass the configuration to `dstack apply`.
+To run it, pass the configuration to `dstack apply`.
```shell
$ dstack apply -f examples/inference/trtllm/convert-model.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
Submit the run convert-model? [y/n]: y
@@ -228,7 +227,7 @@ Here’s the task config that builds a TensorRT-LLM model and uploads it to S3 w
- AWS_SECRET_ACCESS_KEY
- AWS_DEFAULT_REGION
- MAX_SEQ_LEN=8192 # Sum of Max Input Length & Max Output Length
- - MAX_INPUT_LEN=4096
+ - MAX_INPUT_LEN=4096
- MAX_BATCH_SIZE=256
- TRITON_MAX_BATCH_SIZE=1
- INSTANCE_COUNT=1
@@ -260,15 +259,15 @@ Here’s the task config that builds a TensorRT-LLM model and uploads it to S3 w
```
-To run it, pass the configuration to `dstack apply`.
+To run it, pass the configuration to `dstack apply`.
```shell
$ dstack apply -f examples/inference/trtllm/build-model.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
Submit the run build-model? [y/n]: y
@@ -302,25 +301,25 @@ Below is the service configuration that deploys DeepSeek R1 Distill Llama 8B.
- ./aws/install
- aws s3 sync s3://${S3_BUCKET_NAME}/tllm_engine_1gpu_bf16 ./tllm_engine_1gpu_bf16
- git clone https://github.com/triton-inference-server/server.git
- - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo --tokenizer tokenizer_dir --openai-port 8000
+ - python3 server/python/openai/openai_frontend/main.py --model-repository s3://${S3_BUCKET_NAME}/triton_model_repo --tokenizer tokenizer_dir --openai-port 8000
port: 8000
model: ensemble
resources:
gpu: A100:40GB
-
+
```
-To run it, pass the configuration to `dstack apply`.
+To run it, pass the configuration to `dstack apply`.
```shell
$ dstack apply -f examples/inference/trtllm/serve-distill.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 vastai us-iowa 12xCPU, 85GB, 1xA100 (40GB) yes $0.66904
Submit the run serve-distill? [y/n]: y
@@ -331,7 +330,7 @@ Provisioning...
## Access the endpoint
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
+If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `
/proxy/models//`.
@@ -360,12 +359,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway./`.
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/inference/trtllm` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/trtllm){:target="_blank"}.
## What's next?
diff --git a/examples/inference/vllm/README.md b/examples/inference/vllm/README.md
index 57c6758301..d646ea2874 100644
--- a/examples/inference/vllm/README.md
+++ b/examples/inference/vllm/README.md
@@ -7,14 +7,13 @@ description: "This example shows how to deploy Llama 3.1 to any cloud or on-prem
This example shows how to deploy Llama 3.1 8B with `dstack` using [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/){:target="_blank"}.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -60,14 +59,14 @@ resources:
### Running a configuration
-To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
+To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
```shell
$ dstack apply -f examples/inference/vllm/.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
+ # BACKEND REGION RESOURCES SPOT PRICE
1 runpod CA-MTL-1 18xCPU, 100GB, A5000:24GB yes $0.12
2 runpod EU-SE-1 18xCPU, 100GB, A5000:24GB yes $0.12
3 gcp us-west4 27xCPU, 150GB, A5000:24GB:2 yes $0.23
@@ -79,7 +78,7 @@ Provisioning...
```
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
+If no gateway is created, the model will be available via the OpenAI-compatible endpoint
at `/proxy/models//`.
@@ -107,12 +106,12 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway./`.
## Source code
-The source-code of this example can be found in
+The source-code of this example can be found in
[`examples/inference/vllm` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm).
## What's next?
diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md
index b1390fd525..ac098fa70c 100644
--- a/examples/llms/deepseek/README.md
+++ b/examples/llms/deepseek/README.md
@@ -2,19 +2,18 @@
This example walks you through how to deploy and
train [Deepseek :material-arrow-top-right-thin:{ .external }](https://huggingface.co/deepseek-ai){:target="_blank"}
-models with `dstack`.
+models with `dstack`.
> We used Deepseek-R1 distilled models and Deepseek-V2-Lite, a 16B model with the same architecture as Deepseek-R1 (671B). Deepseek-V2-Lite retains MLA and DeepSeekMoE but requires less memory, making it ideal for testing and fine-tuning on smaller GPUs.
??? info "Prerequisites"
- Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
+ Once `dstack` is [installed](https://dstack.ai/docs/installation), clone the repo with examples.
```shell
$ git clone https://github.com/dstackai/dstack
$ cd dstack
- $ dstack init
```
@@ -52,13 +51,13 @@ Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` usin
=== "vLLM"
-
+
```yaml
type: service
name: deepseek-r1-amd
-
+
image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
@@ -68,7 +67,7 @@ Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B` usin
--max-model-len $MAX_MODEL_LEN
--trust-remote-code
port: 8000
-
+
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
@@ -83,7 +82,7 @@ Note, when using `Deepseek-R1-Distill-Llama-70B` with `vLLM` with a 192GB GPU, w
Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-70B`
using [TGI on Gaudi :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/tgi-gaudi){:target="_blank"}
-and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/HabanaAI/vllm-fork){:target="_blank"} (Gaudi fork) with Intel Gaudi 2.
+and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/HabanaAI/vllm-fork){:target="_blank"} (Gaudi fork) with Intel Gaudi 2.
> Both [TGI on Gaudi :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/tgi-gaudi){:target="_blank"}
> and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/HabanaAI/vllm-fork){:target="_blank"} do not support `Deepseek-V2-Lite`.
@@ -151,7 +150,7 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/Haban
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- HABANA_VISIBLE_DEVICES=all
- - OMPI_MCA_btl_vader_single_copy_mechanism=none
+ - OMPI_MCA_btl_vader_single_copy_mechanism=none
commands:
- git clone https://github.com/HabanaAI/vllm-fork.git
@@ -166,13 +165,13 @@ and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/Haban
port: 8000
```
-
+
### NVIDIA
Here's an example of a service that deploys `Deepseek-R1-Distill-Llama-8B`
using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"}
-and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} with NVIDIA GPUs.
+and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"} with NVIDIA GPUs.
Both SGLang and vLLM also support `Deepseek-V2-Lite`.
=== "SGLang"
@@ -181,7 +180,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`.
```yaml
type: service
name: deepseek-r1-nvidia
-
+
image: lmsysorg/sglang:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
@@ -190,10 +189,10 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`.
--model-path $MODEL_ID
--port 8000
--trust-remote-code
-
+
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
+
resources:
gpu: 24GB
```
@@ -205,17 +204,17 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`.
```yaml
type: service
name: deepseek-r1-nvidia
-
+
image: vllm/vllm-openai:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- MAX_MODEL_LEN=4096
commands:
- vllm serve $MODEL_ID
- --max-model-len $MAX_MODEL_LEN
- port: 8000
+ --max-model-len $MAX_MODEL_LEN
+ port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
+
resources:
gpu: 24GB
```
@@ -253,9 +252,9 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc
```shell
$ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
- # BACKEND REGION RESOURCES SPOT PRICE
- 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49
-
+ # BACKEND REGION RESOURCES SPOT PRICE
+ 1 runpod EU-RO-1 24xCPU, 283GB, 1xMI300X (192GB) no $2.49
+
Submit the run deepseek-r1-amd? [y/n]: y
Provisioning...
@@ -291,7 +290,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
```
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
is available at `https://gateway.