Skip to content
4 changes: 2 additions & 2 deletions docs/docs/concepts/fleets.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Fleets, I wopuld probably make it more visible that user can set either fixed number of nodes or a range. Currently we only show a fixed number. A range is going to be even more popular choice. I would show both and explicitely tell why one or the other should be used.

Let me know if you want me to update it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the range example should also mention idle_duration explicitely.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to push a commit

Original file line number Diff line number Diff line change
Expand Up @@ -214,10 +214,10 @@ blocks: 4
#### Idle duration

By default, fleet instances stay `idle` for 3 days and can be reused within that time.
Comment thread
r4victor marked this conversation as resolved.
If the fleet is not reused within this period, it is automatically terminated.
If an instance is not reused within this period, it is automatically terminated.

To change the default idle duration, set
[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the run configuration (e.g., `0s`, `1m`, or `off` for
[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the fleet configuration (e.g., `0s`, `1m`, or `off` for
unlimited).

<div editor-title="examples/misc/fleets/.dstack.yml">
Expand Down
15 changes: 5 additions & 10 deletions docs/docs/concepts/snippets/manage-fleets.ext
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
### Creation policy

By default, when you run `dstack apply` with a dev environment, task, or service,
if no `idle` instances from the available fleets meet the requirements, `dstack` creates a new fleet
if no `idle` instances from the available fleets meet the requirements, `dstack` provisions a new instance
using configured backends.

To ensure `dstack apply` doesn't create a new fleet but reuses an existing one,
To ensure `dstack apply` doesn't provision a new instance but reuses an existing one,
pass `-R` (or `--reuse`) to `dstack apply`.

<div class="termy">
Expand All @@ -19,12 +19,7 @@ Or, set [`creation_policy`](../reference/dstack.yml/dev-environment.md#creation_

### Idle duration

If a fleet is created automatically, it stays `idle` for 5 minutes by default and can be reused within that time.
If the fleet is not reused within this period, it is automatically terminated.
If a run provisions a new instance, the instance stays `idle` for 5 minutes by default and can be reused within that time.
Comment thread
r4victor marked this conversation as resolved.
If the instance is not reused within this period, it is automatically terminated.
To change the default idle duration, set
[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the run configuration (e.g., `0s`, `1m`, or `off` for
unlimited).

!!! info "Fleets"
For greater control over fleet provisioning, it is recommended to create
[fleets](fleets.md) explicitly.
[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the run configuration (e.g., `0s`, `1m`, or `off` for unlimited).
8 changes: 2 additions & 6 deletions docs/docs/concepts/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,12 +170,8 @@ Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
For convenience, `~/.ssh/config` is preconfigured with these options, so a simple `ssh <node_ip>` is enough.
For a list of nodes IPs check the `DSTACK_NODES_IPS` environment variable.

!!! info "Fleets"
Distributed tasks can only run on fleets with
[cluster placement](fleets.md#cloud-placement).
While `dstack` can provision such fleets automatically, it is
recommended to create them via a fleet configuration
to ensure the highest level of inter-node connectivity.
!!! info "Cluster fleets"
To run distributed tasks, you need to create a fleet with [`placement: cluster`](fleets.md#cloud-placement).

> See the [Clusters](../guides/clusters.md) guide for more details on how to use `dstack` on clusters.

Expand Down
18 changes: 7 additions & 11 deletions docs/docs/guides/protips.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,11 +190,9 @@ See more Docker examples [here](https://github.com/dstackai/dstack/tree/master/e
### Creation policy

By default, when you run `dstack apply` with a dev environment, task, or service,
`dstack` reuses `idle` instances from an existing [fleet](../concepts/fleets.md).
If no `idle` instances match the requirements, `dstack` automatically creates a new fleet
using configured backends.
if no `idle` instances from the available fleets meet the requirements, `dstack` provisions a new instance using configured backends.

To ensure `dstack apply` doesn't create a new fleet but reuses an existing one,
To ensure `dstack apply` doesn't provision a new instance but reuses an existing one,
pass `-R` (or `--reuse`) to `dstack apply`.

<div class="termy">
Expand All @@ -205,16 +203,14 @@ $ dstack apply -R -f examples/.dstack.yml

</div>

Or, set [`creation_policy`](../reference/dstack.yml/dev-environment.md#creation_policy) to `reuse` in the run configuration.

### Idle duration

If a fleet is created automatically, it stays `idle` for 5 minutes by default and can be reused within that time.
If the fleet is not reused within this period, it is automatically terminated.
If a run provisions a new instance, the instance stays `idle` for 5 minutes by default and can be reused within that time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't we dropped the defaults for idle_duration?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped max_duration default long ago but not idle_duration

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, its something else. Should I create a new issue about dropping idle_duration defaults? I think it's important

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with idle_duration: off by default, we'll have a much higher chance to forget to set it and pay $$$ for idle instances. Otherwise, I support getting rid of random defaults.

If the instance is not reused within this period, it is automatically terminated.
To change the default idle duration, set
[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the run configuration (e.g., `0s`, `1m`, or `off` for
unlimited).

> For greater control over fleet provisioning, configuration, and lifecycle management, it is recommended to use
> [fleets](../concepts/fleets.md) directly.
[`idle_duration`](../reference/dstack.yml/fleet.md#idle_duration) in the run configuration (e.g., `0s`, `1m`, or `off` for unlimited).

## Volumes

Expand Down
42 changes: 41 additions & 1 deletion docs/docs/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Quickstart

> Before using `dstack`, ensure you've [installed](installation/index.md) the server, or signed up for [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}.
> Before using `dstack`, ensure you've [installed](installation/index.md) the server, or signed up for [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}

## Set up a directory

Expand All @@ -14,6 +14,46 @@ $ mkdir quickstart && cd quickstart

</div>

## Create a fleet

Before submitting runs, you need to create a fleet where new instances will be provisioned.

### Define a configuration
Comment thread
r4victor marked this conversation as resolved.
Outdated

Create the following fleet configuration inside your project folder:

<div editor-title="fleet.dstack.yml">

```yaml
type: fleet
name: default-fleet
Comment thread
r4victor marked this conversation as resolved.
Outdated
nodes: 0..
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, I'd also add resources to show that it's possible to limit what GPU types are allowed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3249

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe once #3249 is fixed then?

```

</div>

### Apply the configuration
Comment thread
r4victor marked this conversation as resolved.
Outdated

Apply the configuration via [`dstack apply`](reference/cli/dstack/apply.md):

<div class="termy">

```shell
$ dstack apply -f fleet.dstack.yml

# BACKEND REGION RESOURCES SPOT PRICE
1 gcp us-west4 2xCPU, 8GB, 100GB (disk) yes $0.010052
2 azure westeurope 2xCPU, 8GB, 100GB (disk) yes $0.0132
3 gcp europe-central2 2xCPU, 8GB, 100GB (disk) yes $0.013248

Fleet cloud-fleet does not exist yet.
Create the fleet? [y/n]: y
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
defalut-fleet - - - - - 10:36
```

</div>

## Submit your first run

`dstack` supports three types of run configurations.
Expand Down
8 changes: 8 additions & 0 deletions examples/clusters/efa/fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
type: fleet
name: my-efa-fleet

nodes: 2
placement: cluster

resources:
gpu: H100:8
22 changes: 22 additions & 0 deletions examples/clusters/nccl-tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,28 @@

This example shows how to run distributed [NCCL tests :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/nccl-tests){:target="_blank"} with MPI using `dstack`.

## Create fleet

Before running NCCL tests, make sure to create a fleet with `placement: cluster`. Here's a fleet configuration suitable for this example:

<div editor-title="examples/clusters/nccl-tests/fleet.dstack.yml">

```yaml
type: fleet
name: cluster-fleet

nodes: 2
placement: cluster

resources:
gpu: nvidia:1..8
shm_size: 16GB
```

</div>

> For more details on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.

## Running as a task

Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).
Expand Down
9 changes: 9 additions & 0 deletions examples/clusters/nccl-tests/fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
type: fleet
name: cluster-fleet

nodes: 2
placement: cluster

resources:
gpu: nvidia:1..8
shm_size: 16GB
22 changes: 22 additions & 0 deletions examples/clusters/rccl-tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,28 @@

This example shows how to run distributed [RCCL tests :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rccl-tests){:target="_blank"} with MPI using `dstack`.

## Create fleet

Before running RCCL tests, make sure to create a fleet with `placement: cluster`. Here's a fleet configuration suitable for this example:

<div editor-title="examples/clusters/rccl-tests/fleet.dstack.yml">

```yaml
type: fleet
name: cluster-fleet

nodes: 2
placement: cluster

resources:
gpu: MI300X:8
```

</div>

> For more details on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.


## Running as a task

Here's an example of a task that runs AllReduce test on 2 nodes, each with 8 `Mi300x` GPUs (16 processes in total).
Expand Down
8 changes: 8 additions & 0 deletions examples/clusters/rccl-tests/fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
type: fleet
name: cluster-fleet

nodes: 2
placement: cluster

resources:
gpu: MI300X:8
22 changes: 19 additions & 3 deletions examples/distributed-training/axolotl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,27 @@ This example walks you through how to run distributed fine-tune using [Axolotl :
```
</div>

## Create a fleet
## Create fleet

Before submitting distributed training runs, make sure to create a fleet with a `placement` set to `cluster`.
Before submitting distributed training runs, make sure to create a fleet with `placement: cluster`. Here's a fleet configuration suitable for this example:

> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.
<div editor-title="examples/distributed-training/axolotl/fleet.dstack.yml">

```yaml
type: fleet
name: axolotl-fleet

nodes: 2
placement: cluster

resources:
gpu: 80GB:8
shm_size: 128GB
```

</div>

> For more details on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.

## Define a configuration

Expand Down
9 changes: 9 additions & 0 deletions examples/distributed-training/axolotl/fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
type: fleet
name: axolotl-fleet

nodes: 2
placement: cluster

resources:
gpu: 80GB:8
shm_size: 128GB
24 changes: 20 additions & 4 deletions examples/distributed-training/ray-ragen/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,31 @@
# Ray + RAGEN

This example shows how use `dstack` and [RAGEN :material-arrow-top-right-thin:{ .external }](https://github.com/RAGEN-AI/RAGEN){:target="_blank"}
to fine-tune an agent on mulitiple nodes.
to fine-tune an agent on multiple nodes.

Under the hood `RAGEN` uses [verl :material-arrow-top-right-thin:{ .external }](https://github.com/volcengine/verl){:target="_blank"} for Reinforcement Learning and [Ray :material-arrow-top-right-thin:{ .external }](https://docs.ray.io/en/latest/){:target="_blank"} for ditributed training.
Under the hood `RAGEN` uses [verl :material-arrow-top-right-thin:{ .external }](https://github.com/volcengine/verl){:target="_blank"} for Reinforcement Learning and [Ray :material-arrow-top-right-thin:{ .external }](https://docs.ray.io/en/latest/){:target="_blank"} for distributed training.

## Create fleet

Before submitted disributed training runs, make sure to create a fleet with a `placement` set to `cluster`.
Before submitting distributed training runs, make sure to create a fleet with `placement: cluster`. Here's a fleet configuration suitable for this example:

> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.
<div editor-title="examples/distributed-training/ray-ragen/fleet.dstack.yml">

```yaml
type: fleet
name: ray-ragen-cluster-fleet

nodes: 2
placement: cluster

resources:
gpu: 80GB:8
shm_size: 128GB
```

</div>

> For more details on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.

## Run a Ray cluster

Expand Down
9 changes: 9 additions & 0 deletions examples/distributed-training/ray-ragen/fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
type: fleet
name: ray-ragen-cluster-fleet

nodes: 2
placement: cluster

resources:
gpu: 80GB:8
shm_size: 128GB
22 changes: 19 additions & 3 deletions examples/distributed-training/trl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,27 @@ This example walks you through how to run distributed fine-tune using [TRL :mate

## Create fleet

Before submitting distributed training runs, make sure to create a fleet with a `placement` set to `cluster`.
Before submitting distributed training runs, make sure to create a fleet with `placement: cluster`. Here's a fleet configuration suitable for this example:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In single-node-training, you don't add Create fleet section, why?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd probably make this section collapsed by default, as it repeats everywhere.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumed users may already have a suitable fleet since cluster is not required – for the same reason Tasks, Services, Dev environments pages don't have Create fleet. But we can add Create fleet section everywhere if you like that.


> For more detials on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.
<div editor-title="examples/distributed-training/trl/fleet.dstack.yml">

## Define a configurtation
```yaml
type: fleet
name: trl-train-fleet

nodes: 2
placement: cluster

resources:
gpu: 80GB:8
shm_size: 128GB
```

</div>

> For more details on how to use clusters with `dstack`, check the [Clusters](https://dstack.ai/docs/guides/clusters) guide.

## Define a configuration

Once the fleet is created, define a distributed task configuration. Here's an example of such a task.

Expand Down
9 changes: 9 additions & 0 deletions examples/distributed-training/trl/fleet.dstack.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
type: fleet
name: trl-train-fleet

nodes: 2
placement: cluster

resources:
gpu: 80GB:8
shm_size: 128GB
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -220,10 +220,10 @@ nav:
- Quickstart: docs/quickstart.md
- Concepts:
- Backends: docs/concepts/backends.md
- Fleets: docs/concepts/fleets.md
- Dev environments: docs/concepts/dev-environments.md
- Tasks: docs/concepts/tasks.md
- Services: docs/concepts/services.md
- Fleets: docs/concepts/fleets.md
- Volumes: docs/concepts/volumes.md
- Secrets: docs/concepts/secrets.md
- Projects: docs/concepts/projects.md
Expand Down