0.19.12
Clusters
Simplified use of MPI
startup_order and stop_criteria
New run configuration properties are introduced:
startup_order: any/master-first/workers-firstspecifies the order in which master and workers jobs are started.stop_criteria: all-done/master-donespecifies the criteria when a multi-node run should be considered finished.
These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.
DSTACK_MPI_HOSTFILE
dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.
Below is the updated NCCL tests example.
CLI
We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.
Examples
Distributed training
TRL
The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.
Axolotl
The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.
What's changed
- [Feature] Update
.gitignorelogic to catch more cases by @colinjc in #2695 - [Bug] Increase
upload_codeclient timeout by @r4victor in #2709 - [Bug] Fix missing
apt-get updateby @r4victor in #2710 - [Internal]: Update git hooks and
package.jsonby @olgenn in #2706 - [Examples] Add distributed Axolotl and TRL example by @Bihan in #2703
- [Docs] Update
dstack-proxycontributing guide by @jvstme in #2683 - [Feature] Implement
DSTACK_MPI_HOSTFILEby @r4victor in #2718 - [Feature] Implement
startup_orderandstop_criteriaby @r4victor in #2714 - [Bug] Fix CLI exiting while master starting by @r4victor in #2720
- [Examples] Simplify NCCL tests example by @r4victor in #2723
- [Examples] Update TRL Single Node example to uv by @Bihan in #2715
- [Bug] Fix backward compatibility when creating fleets by @jvstme in #2727
- [UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in #2716
- [Bug] Fix relative paths in
dstack apply --repoby @jvstme in #2733 - [Internal]: Drop hardcoded regions from the backend template by @jvstme in #2734
- [Internal]: Update backend template to match
ruffformatting by @jvstme in #2735
Full changelog: 0.19.11...0.19.12