0.19.9
Metrics
Previously, dstack stored and displayed only metrics within the last hour. If a run or job is finished, eventually metrics disappeared.
Now, dstack stores the last hour window of metrics for all finished runs.
AMD
On AMD, a wider range of ROCm/AMD SMI versions is now supported. Previously, for certain versions, metrics were not shown properly.
CLI
Container exit status
The CLI now displays the container exit status of each failed run or job:
This information can be seen via dstack ps if you pass -v:
Server
Robust handling of networking issues
It sometimes happens that the dstack server cannot establish connections to running instances due to networking problems or because instances become temporarily unreachable. Previously, dstack failed jobs very quickly in such cases. Now, the server puts a graceful timeout of 2 minutes before considering jobs failed if instances are unreachable.
Environment variables
Two new environment variables are now available within runs:
DSTACK_RUN_IDstores the UUID of the run. It's unique for a run unlikeDSTACK_RUN_NAME.DSTACK_JOB_IDstores the UUID of the job submission. It's unique for every replica, job, and retry attempt.
What's changed
- [UX] Set the default
gpucount to1..by @r4victor in #2624 - [UX] Introduce
JOB_DISCONNECTED_RETRY_TIMEOUTby @r4victor in #2627 - [Feature] Pull and store process exit status from jobs by @un-def in #2615
- [Internal] Add
dstackai/amd-smiimage by @un-def in #2611 - [Runner] Improve GPU metrics collector by @un-def in #2612
- [Feature] Set
DSTACK_RUN_IDandDSTACK_JOB_IDby @r4victor in #2622 - [Internal] Drop override message when overriding finished runs by @r4victor in #2623
- [Nebius] Support InfiniBand fabric for
us-central1by @jvstme in #2629 - [Feature] Keep the last metrics for finished jobs by @un-def in #2628
- [Nebius] Update Nebius default project detection by @jvstme in #2633
- [CUDO] Update the VM image by @r4victor in #2636
- [Docs]: Nebius InfiniBand clusters by @jvstme in #2634
- [Examples] Added
examples/rccl-testsby @Bihan in #2613 - [Docs] Extracted Distributed training examples by @peterschmidt85 in #2614
- [Docs] fix YAML indent on trl example by @aaroniscode in #2617
- [Docs] Add example of including plugins into the dstack-server Docker image by @r4victor in #2620
- [Tests] Fix python-test by @peterschmidt85 in #2619
New contributors
- @aaroniscode made their first contribution in #2617
Full Changelog: 0.19.8...0.19.9
