Skip to content

Commit 36cb5aa

Browse files
BihanBihan  Ranapeterschmidt85
authored
Add distributed Axolotl and TRL example (#2703)
* Add distributed Axolotl and TRL example * Resolve review comments * [Docs] Renamed `Fine-tuning` to `Single-node training` for more clarity and consistence * Remove uv from examples with ngc and remove multi-node example from single node training * [Examples] Minor improvements regarding TRL and Axolotl * Update Axolotl Single Node Training Example --------- Co-authored-by: Bihan Rana <bihan@Bihans-MacBook-Pro.local> Co-authored-by: peterschmidt85 <andrey.cheptsov@gmail.com>
1 parent 3bdf903 commit 36cb5aa

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+741
-903
lines changed

docs/blog/posts/intel-gaudi.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ model using [Optimum for Intel Gaudi :material-arrow-top-right-thin:{ .external
9898
and [DeepSpeed :material-arrow-top-right-thin:{ .external }](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide/DeepSpeed_User_Guide.html#deepspeed-user-guide){:target="_blank"} with
9999
the [`lvwerra/stack-exchange-paired` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/lvwerra/stack-exchange-paired){:target="_blank"} dataset:
100100

101-
<div editor-title="examples/fine-tuning/trl/intel/.dstack.yml">
101+
<div editor-title="examples/single-node-training/trl/intel/.dstack.yml">
102102

103103
```yaml
104104
type: task
@@ -152,7 +152,7 @@ Submit the task using the [`dstack apply`](../../docs/reference/cli/dstack/apply
152152
<div class="termy">
153153

154154
```shell
155-
$ dstack apply -f examples/fine-tuning/trl/intel/.dstack.yml -R
155+
$ dstack apply -f examples/single-node-training/trl/intel/.dstack.yml -R
156156
```
157157

158158
</div>

docs/blog/posts/tpu-on-gcp.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ Below is an example of fine-tuning Llama 3.1 8B using [Optimum TPU :material-arr
158158
and the [Abirate/english_quotes :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/Abirate/english_quotes){:target="_blank"}
159159
dataset.
160160

161-
<div editor-title="examples/fine-tuning/optimum-tpu/llama31/train.dstack.yml">
161+
<div editor-title="examples/single-node-training/optimum-tpu/llama31/train.dstack.yml">
162162

163163
```yaml
164164
type: task
@@ -171,8 +171,8 @@ env:
171171
commands:
172172
- git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git
173173
- mkdir -p optimum-tpu/examples/custom/
174-
- cp examples/fine-tuning/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py
175-
- cp examples/fine-tuning/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml
174+
- cp examples/single-node-training/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py
175+
- cp examples/single-node-training/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml
176176
- cd optimum-tpu
177177
- pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html
178178
- pip install datasets evaluate

docs/docs/concepts/tasks.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The filename must end with `.dstack.yml` (e.g. `.dstack.yml` or `dev.dstack.yml`
1010

1111
[//]: # (TODO: Make tabs - single machine & distributed tasks & web app)
1212

13-
<div editor-title="examples/fine-tuning/axolotl/train.dstack.yml">
13+
<div editor-title="examples/single-node-training/axolotl/train.dstack.yml">
1414

1515
```yaml
1616
type: task
@@ -26,7 +26,7 @@ env:
2626
- WANDB_API_KEY
2727
# Commands of the task
2828
commands:
29-
- accelerate launch -m axolotl.cli.train examples/fine-tuning/axolotl/config.yaml
29+
- accelerate launch -m axolotl.cli.train examples/single-node-training/axolotl/config.yaml
3030

3131
resources:
3232
gpu:
@@ -461,4 +461,4 @@ it does not block other runs with lower priority from scheduling.
461461
!!! info "What's next?"
462462
1. Read about [dev environments](dev-environments.md), [services](services.md), and [repos](repos.md)
463463
2. Learn how to manage [fleets](fleets.md)
464-
3. Check the [Axolotl](/examples/fine-tuning/axolotl) example
464+
3. Check the [Axolotl](/examples/single-node-training/axolotl) example

docs/examples.md

Lines changed: 44 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,21 @@ hide:
1212
}
1313
</style>
1414

15-
## Fine-tuning
15+
## Single-node training
1616

1717
<div class="tx-landing__highlights_grid">
18-
<a href="/examples/fine-tuning/axolotl"
18+
<a href="/examples/single-node-training/trl"
19+
class="feature-cell">
20+
<h3>
21+
TRL
22+
</h3>
23+
24+
<p>
25+
Fine-tune Llama 3.1 8B on a custom dataset using TRL.
26+
</p>
27+
</a>
28+
29+
<a href="/examples/single-node-training/axolotl"
1930
class="feature-cell">
2031
<h3>
2132
Axolotl
@@ -25,19 +36,47 @@ hide:
2536
Fine-tune Llama 4 on a custom dataset using Axolotl.
2637
</p>
2738
</a>
39+
</div>
2840

29-
<a href="/examples/fine-tuning/trl"
30-
class="feature-cell">
41+
## Distributed training
42+
43+
<div class="tx-landing__highlights_grid">
44+
<a href="/examples/distributed-training/trl"
45+
class="feature-cell sky">
3146
<h3>
3247
TRL
3348
</h3>
3449

3550
<p>
36-
Fine-tune Llama 3.1 8B on a custom dataset using TRL.
51+
Fine-tune LLM on multiple nodes
52+
with TRL, Accelerate, and Deepspeed.
53+
</p>
54+
</a>
55+
<a href="/examples/distributed-training/axolotl"
56+
class="feature-cell sky">
57+
<h3>
58+
Axolotl
59+
</h3>
60+
61+
<p>
62+
Fine-tune LLM on multiple nodes
63+
with Axolotl.
64+
</p>
65+
</a>
66+
<a href="/examples/distributed-training/ray-ragen"
67+
class="feature-cell sky">
68+
<h3>
69+
Ray+RAGEN
70+
</h3>
71+
72+
<p>
73+
Fine-tune an agent on multiple nodes
74+
with RAGEN, verl, and Ray.
3775
</p>
3876
</a>
3977
</div>
4078

79+
4180
## Clusters
4281

4382
<div class="tx-landing__highlights_grid">
@@ -83,22 +122,6 @@ hide:
83122
</a>
84123
</div>
85124

86-
## Distributed training
87-
88-
<div class="tx-landing__highlights_grid">
89-
<a href="/examples/distributed-training/ray-ragen"
90-
class="feature-cell sky">
91-
<h3>
92-
Ray+RAGEN
93-
</h3>
94-
95-
<p>
96-
Fine-tune an agent on multiple nodes
97-
with RAGEN, verl, and Ray.
98-
</p>
99-
</a>
100-
</div>
101-
102125
## Inference
103126

104127
<div class="tx-landing__highlights_grid">
@@ -197,31 +220,6 @@ hide:
197220
</a>
198221
</div>
199222

200-
## LLMs
201-
202-
<div class="tx-landing__highlights_grid">
203-
<a href="/examples/llms/deepseek"
204-
class="feature-cell sky">
205-
<h3>
206-
Deepseek
207-
</h3>
208-
209-
<p>
210-
Deploy and train Deepseek models
211-
</p>
212-
</a>
213-
<a href="/examples/llms/llama"
214-
class="feature-cell sky">
215-
<h3>
216-
Llama
217-
</h3>
218-
219-
<p>
220-
Deploy Llama 4 models
221-
</p>
222-
</a>
223-
</div>
224-
225223
## Misc
226224

227225
<div class="tx-landing__highlights_grid">
File renamed without changes.
File renamed without changes.
File renamed without changes.

docs/examples/single-node-training/trl/index.md

Whitespace-only changes.

docs/overrides/main.html

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -117,12 +117,11 @@
117117

118118
<div class="tx-footer__section">
119119
<div class="tx-footer__section-title">Examples</div>
120-
<a href="/examples#fine-tuning" class="tx-footer__section-link">Fine-tuning</a>
121-
<a href="/examples#clusters" class="tx-footer__section-link">Clusters</a>
120+
<a href="/examples#fine-tuning" class="tx-footer__section-link">Single-node training</a>
122121
<a href="/examples#distributed-training" class="tx-footer__section-link">Distributed training</a>
122+
<a href="/examples#clusters" class="tx-footer__section-link">Clusters</a>
123123
<a href="/examples#inference" class="tx-footer__section-link">Inference</a>
124124
<a href="/examples#accelerators" class="tx-footer__section-link">Accelerators</a>
125-
<a href="/examples#llms" class="tx-footer__section-link">LLMs</a>
126125
<!-- <a href="/examples#misc" class="tx-footer__section-link">Misc</a> -->
127126
</div>
128127

examples/accelerators/amd/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
114114
and the [`mlabonne/guanaco-llama2-1k` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k){:target="_blank"}
115115
dataset.
116116

117-
<div editor-title="examples/fine-tuning/trl/amd/.dstack.yml">
117+
<div editor-title="examples/single-node-training/trl/amd/.dstack.yml">
118118

119119
```yaml
120120
type: task
@@ -140,7 +140,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
140140
- pip install peft
141141
- pip install transformers datasets huggingface-hub scipy
142142
- cd ..
143-
- python examples/fine-tuning/trl/amd/train.py
143+
- python examples/single-node-training/trl/amd/train.py
144144

145145
# Uncomment to leverage spot instances
146146
#spot_policy: auto
@@ -157,7 +157,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
157157
and the [tatsu-lab/alpaca :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/tatsu-lab/alpaca){:target="_blank"}
158158
dataset.
159159

160-
<div editor-title="examples/fine-tuning/axolotl/amd/.dstack.yml">
160+
<div editor-title="examples/single-node-training/axolotl/amd/.dstack.yml">
161161

162162
```yaml
163163
type: task
@@ -213,7 +213,7 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
213213

214214
> To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3.
215215
> You can find the tasks that build and upload the binaries
216-
> in [`examples/fine-tuning/axolotl/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/amd/){:target="_blank"}.
216+
> in [`examples/single-node-training/axolotl/amd/` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd/){:target="_blank"}.
217217

218218
## Running a configuration
219219

@@ -238,8 +238,8 @@ $ dstack apply -f examples/inference/vllm/amd/.dstack.yml
238238
The source-code of this example can be found in
239239
[`examples/inference/tgi/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/tgi/amd){:target="_blank"},
240240
[`examples/inference/vllm/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/inference/vllm/amd){:target="_blank"},
241-
[`examples/fine-tuning/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/amd){:target="_blank"} and
242-
[`examples/fine-tuning/trl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/trl/amd){:target="_blank"}
241+
[`examples/single-node-training/axolotl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/axolotl/amd){:target="_blank"} and
242+
[`examples/single-node-training/trl/amd` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/single-node-training/trl/amd){:target="_blank"}
243243

244244
## What's next?
245245

0 commit comments

Comments
 (0)