0.18.44
GPU utilization policy
To avoid a waste of resources, you can now specify a minimum required GPU utilization for the run. If any GPU has utilization below threshold in all samples in a time window, the run is terminated.
type: task
utilization_policy:
min_gpu_utilization: 30
time_window: 30m
resources:
gpu: nvidia:8:24GBIn this example, if any of 8 GPUs has utilization below 30% in all samples during last 30 minutes, the run will be terminated.
DCGM metrics
dstack can now collect and export NVIDIA DCGM metrics from running jobs on supported backends (AWS, Azure, GCP, OCI) and SSH fleets.
Metrics are disabled by default. See the documenation for how to enable and scrape them.
RunPod Community Cloud
In addition to Secure Cloud, dstack will now use Community Cloud offers in the runpod backend. Community Cloud offers are usually cheaper and can be identified by a two-letter region code.
$ dstack apply -f .dstack.yml -b runpod
# BACKEND REGION INSTANCE SPOT PRICE
1 runpod CA NVIDIA A100 80GB PCIe yes $0.6
2 runpod CA-MTL-3 NVIDIA A100 80GB PCIe yes $0.82
It is possible to opt out of using Community Cloud in the backend settings.
Note
If you've previously configured the runpod backend via the dstack UI, your backend settings will likely contain a fixed set of regions. Previous dstack versions used to add it automatically. You can remove the regions property to allow all regions, including two-letter Community Cloud regions.
What's Changed
- Show
inactivity_durationin run plan in CLI by @jvstme in #2366 - Minor fixes noticed by aider by @r4victor in #2367
- Reexport DCGM metrics from instances by @un-def in #2364
- [Internal]: Update backend contributing docs by @jvstme in #2369
- Allow global admins to edit user emails via the UI by @olgenn in #2377
- Support sign-in via Microsoft EntraID for dstack Enterprise #251 by @olgenn in #2376
- Add
utilization_policyby @un-def in #2375 - Support RunPod Community Cloud by @jvstme in #2378
- Add ORDER BY when selecting multiple rows with FOR UPDATE by @r4victor in #2379
- Allow global admins to edit user emails via the UI by @olgenn in #2381
- Support
inactivity_durationin-place update by @jvstme in #2380 - Improve error message if pulling fails by @jvstme in #2382
- Fix
utilization_policyin profiles by @un-def in #2385 - Set lower and upper limits of
utilization_policy.time_windowby @un-def in #2386 - Try more offers when starting a job by @jvstme in #2387
Full Changelog: 0.18.43...0.18.44