Conversation
This commit implements provisioning GCP A4
clusters with high-performance RoCE networking.
```shell
> dstack fleet
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
gpu 0 gcp (us-west2) cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot) $51.552 idle 21 mins ago
1 gcp (us-west2) cpu=224 mem=3968GB disk=100GB B200:180GB:8 (spot) $51.552 idle 17 mins ago
```
To enable high-performance networking, users need
to create the
[appropriate networks](https://cloud.google.com/ai-hypercomputer/docs/create/create-vm#setup-network)
and configure them in the backend settings.
```yaml
projects:
- name: main
backends:
- type: gcp
project_id: my-project
creds:
type: default
vpc_name: my-vpc-0 # regular, 1 subnet
extra_vpcs:
- my-vpc-1 # regular, 1 subnet
roce_vpcs:
- my-vpc-mrdma # RoCE profile, 8 subnets
```
Then apply a fleet configuration.
```yaml
type: fleet
nodes: 2
placement: cluster
availability_zones: [us-west2-c]
backends: [gcp]
resources:
gpu: 8:b200
```
Each instance in the cluster will then have 10
network interfaces:
- 1 regular interface in the main VPC (`default`
or the one configured in `vpc_name`).
- 1 regular interface in a VPC configured in
`extra_vpcs`.
- 8 RDMA interfaces in the VPC configured in
`roce_vpcs`.
Additionally, this commit optimizes the fetching
and caching of subnets, so that they are fetched
from the API only once, and not separately for
each item in `extra_vpcs`. For some instance
types, this reduces the number of API requests
from 9 to 1, which cuts about 16 seconds from each
offer provisioning attempt.
un-def
approved these changes
Oct 2, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit implements provisioning GCP A4 clusters with high-performance RoCE networking.
To enable high-performance networking, users need to create the appropriate networks and configure them in the backend settings.
Then apply a fleet configuration.
Each instance in the cluster will then have 10 network interfaces:
defaultor the one configured invpc_name).extra_vpcs.roce_vpcs.Additionally, this commit optimizes the fetching and caching of subnets, so that they are fetched from the API only once, and not separately for each item in
extra_vpcs. For some instance types, this reduces the number of API requests from 9 to 1, which cuts about 16 seconds from each offer provisioning attempt.#3088