Skip to content

0.18.44

Choose a tag to compare

@un-def un-def released this 05 Mar 13:50
· 1011 commits to master since this release
e125dff

GPU utilization policy

To avoid a waste of resources, you can now specify a minimum required GPU utilization for the run. If any GPU has utilization below threshold in all samples in a time window, the run is terminated.

type: task

utilization_policy:
  min_gpu_utilization: 30
  time_window: 30m

resources:
  gpu: nvidia:8:24GB

In this example, if any of 8 GPUs has utilization below 30% in all samples during last 30 minutes, the run will be terminated.

DCGM metrics

dstack can now collect and export NVIDIA DCGM metrics from running jobs on supported backends (AWS, Azure, GCP, OCI) and SSH fleets.

Metrics are disabled by default. See the documenation for how to enable and scrape them.

RunPod Community Cloud

In addition to Secure Cloud, dstack will now use Community Cloud offers in the runpod backend. Community Cloud offers are usually cheaper and can be identified by a two-letter region code.

$ dstack apply -f .dstack.yml -b runpod
 #  BACKEND  REGION    INSTANCE               SPOT  PRICE
 1  runpod   CA        NVIDIA A100 80GB PCIe  yes   $0.6
 2  runpod   CA-MTL-3  NVIDIA A100 80GB PCIe  yes   $0.82

It is possible to opt out of using Community Cloud in the backend settings.

Note

If you've previously configured the runpod backend via the dstack UI, your backend settings will likely contain a fixed set of regions. Previous dstack versions used to add it automatically. You can remove the regions property to allow all regions, including two-letter Community Cloud regions.

What's Changed

  • Show inactivity_duration in run plan in CLI by @jvstme in #2366
  • Minor fixes noticed by aider by @r4victor in #2367
  • Reexport DCGM metrics from instances by @un-def in #2364
  • [Internal]: Update backend contributing docs by @jvstme in #2369
  • Allow global admins to edit user emails via the UI by @olgenn in #2377
  • Support sign-in via Microsoft EntraID for dstack Enterprise #251 by @olgenn in #2376
  • Add utilization_policy by @un-def in #2375
  • Support RunPod Community Cloud by @jvstme in #2378
  • Add ORDER BY when selecting multiple rows with FOR UPDATE by @r4victor in #2379
  • Allow global admins to edit user emails via the UI by @olgenn in #2381
  • Support inactivity_duration in-place update by @jvstme in #2380
  • Improve error message if pulling fails by @jvstme in #2382
  • Fix utilization_policy in profiles by @un-def in #2385
  • Set lower and upper limits of utilization_policy.time_window by @un-def in #2386
  • Try more offers when starting a job by @jvstme in #2387

Full Changelog: 0.18.43...0.18.44