Skip to content

Commit 44fa7a4

Browse files
BihanBihan  Ranapeterschmidt85
authored
Add dstack example (#2)
* Add dstack example * Update dstack example * Updated `dstack` example * Minor Update --------- Co-authored-by: Bihan Rana <bihan@Bihans-MacBook-Pro.local> Co-authored-by: peterschmidt85 <andrey.cheptsov@gmail.com>
1 parent 5fe1839 commit 44fa7a4

1 file changed

Lines changed: 118 additions & 0 deletions

File tree

docs/start/multinode.rst

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,124 @@ Slurm
7171
-----
7272
TBD
7373

74+
dstack
75+
------
76+
`dstackai/dstack <https://github.com/dstackai/dstack>`_ is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments
77+
without the need to use K8S or Slurm.
78+
79+
Prerequisite
80+
~~~~~~~~~~~~
81+
Once dstack is `installed <https://dstack.ai/docs/installation>`_, initialize the directory as a repo with ``dstack init``.
82+
83+
.. code-block:: bash
84+
85+
mkdir myproject && cd myproject
86+
dstack init
87+
88+
**Create a fleet**
89+
90+
Before submitting distributed training jobs, create a `dstack` `fleet <https://dstack.ai/docs/concepts/fleets>`_.
91+
92+
Run a Ray cluster task
93+
~~~~~~~~~~~~~~~~~~~~~~
94+
95+
Once the fleet is created, define a Ray cluster task, e.g. in ``ray-cluster.dstack.yml``:
96+
97+
.. code-block:: yaml
98+
99+
type: task
100+
name: ray-verl-cluster
101+
102+
nodes: 2
103+
104+
env:
105+
- WANDB_API_KEY
106+
- PYTHONUNBUFFERED=1
107+
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
108+
109+
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
110+
commands:
111+
- git clone https://github.com/volcengine/verl
112+
- cd verl
113+
- pip install --no-deps -e .
114+
- pip install hf_transfer hf_xet
115+
- |
116+
if [ $DSTACK_NODE_RANK = 0 ]; then
117+
python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
118+
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')"
119+
ray start --head --port=6379;
120+
else
121+
ray start --address=$DSTACK_MASTER_NODE_IP:6379
122+
fi
123+
124+
# Expose Ray dashboard port
125+
ports:
126+
- 8265
127+
128+
resources:
129+
gpu: 80GB:8
130+
shm_size: 128GB
131+
132+
# Save checkpoints on the instance
133+
volumes:
134+
- /checkpoints:/checkpoints
135+
136+
Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`.
137+
138+
.. code-block:: bash
139+
140+
dstack apply -f ray-cluster.dstack.yml
141+
142+
As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution
143+
144+
Submit Ray jobs
145+
~~~~~~~~~~~~~~~
146+
147+
Before you can submit Ray jobs, ensure to install `ray` locally:
148+
149+
.. code-block:: shell
150+
151+
pip install ray
152+
153+
Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``:
154+
155+
.. code-block:: shell
156+
157+
$ RAY_ADDRESS=http://localhost:8265
158+
$ ray job submit \
159+
-- python3 -m verl.trainer.main_ppo \
160+
data.train_files=/root/data/gsm8k/train.parquet \
161+
data.val_files=/root/data/gsm8k/test.parquet \
162+
data.train_batch_size=256 \
163+
data.max_prompt_length=512 \
164+
data.max_response_length=256 \
165+
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
166+
actor_rollout_ref.actor.optim.lr=1e-6 \
167+
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
168+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
169+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
170+
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
171+
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
172+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
173+
critic.optim.lr=1e-5 \
174+
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
175+
critic.ppo_micro_batch_size_per_gpu=4 \
176+
algorithm.kl_ctrl.kl_coef=0.001 \
177+
trainer.project_name=ppo_training \
178+
trainer.experiment_name=qwen-2.5-7B \
179+
trainer.val_before_train=False \
180+
trainer.default_hdfs_dir=null \
181+
trainer.n_gpus_per_node=8 \
182+
trainer.nnodes=2 \
183+
trainer.default_local_dir=/checkpoints \
184+
trainer.save_freq=10 \
185+
trainer.test_freq=10 \
186+
trainer.total_epochs=15 2>&1 | tee verl_demo.log \
187+
trainer.resume_mode=disable
188+
189+
190+
For more details on how `dstack` works, check out its `documentation <https://dstack.ai/docs>`_.
191+
74192
How to debug?
75193
---------------------
76194

0 commit comments

Comments
 (0)