Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions docs/source/en/api/pipelines/cosmos3.md
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,95 @@ if result.action is not None:
</hfoption>
</hfoptions>

## Context parallelism

For long videos or high resolutions, a single forward pass can exceed the memory and latency budget of one GPU. Cosmos 3 supports **context parallelism (CP)** to shard the sequence dimension across multiple GPUs, splitting the attention computation so each device holds only a slice of the tokens.

Cosmos 3 supports **Ulysses** context parallelism (all-to-all sequence/head exchange). Ring attention is not supported.

Unlike most diffusers models, Cosmos 3 does **not** wire CP into the transformer or the declarative [`~ModelMixin.enable_parallelism`] path: its grouped-query attention, separate understanding/generation streams (the generation stream attends to both), and ragged per-stream lengths can't be expressed as a `_cp_plan`. Instead, the model exposes small no-op shard/gather seams, and the implementation lives in [`examples/cosmos3/cosmos_parallel.py`](https://github.com/huggingface/diffusers/blob/main/examples/cosmos3/cosmos_parallel.py) — a self-contained module you can read end to end and adapt. It offers two orthogonal, composable sharding axes:

| Helper | Shards | Use for |
|---|---|---|
| `enable_cosmos3_context_parallel(transformer, cp_mesh)` | sequence (CP / Ulysses) | latency on a model that fits one GPU (`Nano`) |
| `enable_cosmos3_tensor_parallel(transformer, tp_mesh)` | weights (TP) | fitting a model that doesn't fit one GPU (`Super`) |

Use either alone or both together on a 2-D `(tp, cp)` mesh (see [Fitting large models with tensor parallelism](#fitting-large-models-with-tensor-parallelism)).

Two requirements are specific to Cosmos 3:

- Use the `native` attention backend. Cosmos 3 uses grouped-query attention (GQA), and the native SDPA backend is the only one that accepts `enable_gqa` (cuDNN and flash reject it). The helpers expand the KV heads to the query-head count and call SDPA with `enable_gqa=False` so it still dispatches to the flash kernel (the math fallback would materialize the full `[S, S]` scores and OOM on long sequences).
- The CP (Ulysses) degree must divide the query-head count (32 for `Nano`, 64 for `Super`); for TP, the degree must divide the KV heads (8). The understanding (text) and generation (video/sound) streams are sharded independently along the sequence, and ragged lengths are zero-padded internally to a multiple of the world size.

### Run it

The full CLI [`examples/cosmos3/inference_cosmos3.py`](https://github.com/huggingface/diffusers/blob/main/examples/cosmos3/inference_cosmos3.py) reuses these helpers, so **any modality** (text-to-image/video, image-to-video, sound, action modes) runs multi-GPU via `--tp-degree` / `--cp-degree`. Launch with [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html); `--tp-degree * --cp-degree` must equal `--nproc_per_node`. Every rank produces the same output; rank 0 writes it.

```bash
# CP only — Nano (fits one GPU); CP degree must divide 32 query heads.
torchrun --nproc_per_node=4 examples/cosmos3/inference_cosmos3.py --model nano --cp-degree 4 --prompt "..."

# TP only — Super; TP degree must divide 64 query heads and 8 KV heads.
torchrun --nproc_per_node=4 examples/cosmos3/inference_cosmos3.py --model super --tp-degree 4 --prompt "..."

# TP + CP — Super, with sound (TP=2 x CP=2 across 4 GPUs).
torchrun --nproc_per_node=4 examples/cosmos3/inference_cosmos3.py \
--model super --tp-degree 2 --cp-degree 2 --enable-sound --prompt "..."
```

`Super`'s ~120 GB of weights do not fit on one 96 GB GPU, so it needs TP; `Nano` fits on a single GPU, so CP for it is a pure latency optimization. (Omit both flags to run single-GPU.)

### Fitting large models with tensor parallelism

CP shards *activations* but replicates every weight on every rank, so it does not reduce a model's weight footprint — a model that doesn't fit on one GPU still won't fit under CP alone. To shard the **weights**, `enable_cosmos3_tensor_parallel(transformer, tp_mesh)` applies Megatron-style tensor parallelism on a second, orthogonal mesh axis:

- The attention and MLP projections are column/row sharded across the TP group (`to_q/to_k/to_v` + `add_q/k/v` and the MLPs' `gate/up` are column-parallel; `to_out/to_add_out` and the MLPs' `down` are row-parallel with an all-reduce). Each rank ends up owning `query_heads / tp` query heads and `kv_heads / tp` KV heads.
- TP composes with CP on a 2-D `(tp, cp)` device mesh: TP splits heads/weights persistently, CP shards the sequence on top. The constraints are `tp` divides the KV heads (8), and `tp * cp` divides the query heads (32 for `Nano`, 64 for `Super`).
- Weights are loaded to CPU and sharded onto the GPUs layer by layer, so the full model is never materialized on a single device.

> [!TIP]
> TP issues an all-reduce on every attention and MLP block, so it is bandwidth-heavy. On hosts without NVLink it is the dominant cost; prefer the smallest TP degree that makes the weights fit and put the remaining GPUs into CP.

### Use it in your own pipeline

The CLI flags are convenient, but you can call the helpers directly. Build the device mesh, apply TP *before* the model lands on the GPUs, switch to the `native` backend, then enable CP — the rest of your pipeline code is unchanged:

```python
from torch.distributed.device_mesh import init_device_mesh

# Make the helper module importable.
import sys
sys.path.insert(0, "examples/cosmos3")
from cosmos_parallel import (
enable_cosmos3_context_parallel,
enable_cosmos3_flash_attention,
enable_cosmos3_tensor_parallel,
)

# torchrun sets RANK / WORLD_SIZE / LOCAL_RANK. Pick tp_degree * cp_degree == world size.
mesh = init_device_mesh("cuda", (tp_degree, cp_degree), mesh_dim_names=("tp", "cp"))

# Load on CPU first; a TP-sharded model may not fit one GPU.
pipe = Cosmos3OmniPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
if tp_degree > 1:
enable_cosmos3_tensor_parallel(pipe.transformer, mesh["tp"]) # shard weights -> GPUs
pipe.to(f"cuda:{local_rank}") # move the replicated remainder
pipe.transformer.set_attention_backend("native")
if cp_degree > 1:
enable_cosmos3_context_parallel(pipe.transformer, mesh["cp"]) # shard the sequence
elif tp_degree > 1:
enable_cosmos3_flash_attention(pipe.transformer) # GQA-safe dense attention

# `pipe(...)` is called exactly as in the single-GPU workflows above.
```

For CP only (no weight sharding), use a 1-D mesh: `init_device_mesh("cuda", (world_size,), mesh_dim_names=("cp",))` and just `enable_cosmos3_context_parallel`.

> [!TIP]
> On some multi-GPU topologies the first NCCL all-to-all can hang. If a CP run stalls at the start of the first denoising step, set `NCCL_P2P_DISABLE=1` in the environment before launching `torchrun`.

CP and TP compose with all the workflows above (text-to-video, image-to-video, text-to-video with sound, and action-conditioned generation) and with both the `Nano` and `Super` checkpoints — only the pipeline construction and the parallelism setup lines change.

## Metadata templates

`tokenize_prompt` appends short metadata sentences inside the user message so the LLM sees the conditioning the model was trained with. The positive prompt gets sentences like *"The video is 7.9 seconds long and is of 24 FPS."* and *"This video is of 720x1280 resolution."*; the negative prompt gets the inverse (*"… is not …"*).
Expand Down
65 changes: 62 additions & 3 deletions examples/cosmos3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,14 @@ The canonical reference for `Cosmos3OmniPipeline` lives in the diffusers docs:
examples there as the source of truth for application code — they cover text-to-image,
text-to-video, image-to-video, and text+sound modes.

This directory provides a small CLI wrapper (`inference_cosmos3.py`) that exercises the full
load → encode → denoise → decode path against either the Hub release or a local checkpoint
during development.
This directory provides two files:

- `inference_cosmos3.py` — the runnable CLI (text-to-image/video, image-to-video, sound, action
modes). Single-GPU by default; pass `--tp-degree` / `--cp-degree` and launch with `torchrun`
to run any modality multi-GPU (see [Multi-GPU inference](#multi-gpu-inference-context-parallelism)
below).
- `cosmos_parallel.py` — the importable multi-GPU helpers (context + tensor parallelism). No
`main`; the CLI imports from it. Read it to understand or adapt the sharding.

## Setup

Expand Down Expand Up @@ -168,3 +173,57 @@ Pick the tier that matches the native resolution of your conditioning input (`48
| `--no-duration-template` | off | Skip the duration metadata sentence appended to the prompt and negative prompt. Ignored for `--num-frames 1` and for action modes (which build a structured caption instead). |
| `--no-resolution-template` | off | Skip the resolution metadata sentence appended to the prompt and negative prompt. Ignored for action modes. |
| `--output` | `.` | Directory to write `sample.jpg` or `sample.mp4`. |

## Multi-GPU inference (context parallelism)

Cosmos 3 can be sharded across GPUs on two orthogonal axes (implemented in `cosmos_parallel.py`):

- **Context parallelism (CP)** — `enable_cosmos3_context_parallel`. The *sequence* is sharded
across GPUs and attention runs with two Ulysses all-to-all collectives per layer, cutting
per-step latency for long videos / high resolutions. Weights are replicated, so this is for
models that already fit one GPU (`Nano`).
- **Tensor parallelism (TP)** — `enable_cosmos3_tensor_parallel`. The attention and MLP *weight*
matrices are sharded across GPUs (Megatron-style), so a checkpoint that doesn't fit one GPU
(`Super`, ~120 GB) loads. The sequence is not sharded.
- **TP + CP** — both at once on a 2-D `(tp, cp)` mesh: a large model *and* latency.

The model itself carries no parallelism logic — it exposes small no-op shard/gather seams, and
`cosmos_parallel.py` implements the entire path (collectives, GQA KV-head handling, ragged-length
padding, the dual-pathway attention, weight sharding) behind those two helpers. It is meant to be
read end to end and adapted.

The CLI imports these helpers, so you run **any modality** (text-to-image/video, image-to-video,
sound, action modes) multi-GPU by adding `--tp-degree` / `--cp-degree` and launching with
[torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html) — `--tp-degree * --cp-degree`
must equal `--nproc_per_node`:

```bash
# CP only (Nano): CP degree must divide the 32 query heads.
torchrun --nproc_per_node 4 examples/cosmos3/inference_cosmos3.py --model nano --cp-degree 4 --prompt "..."

# TP only (Super): TP degree must divide the 64 query heads and 8 KV heads.
torchrun --nproc_per_node 4 examples/cosmos3/inference_cosmos3.py --model super --tp-degree 4 --prompt "..."

# TP + CP (Super), 4 GPUs as 2 x 2, with sound:
torchrun --nproc_per_node 4 examples/cosmos3/inference_cosmos3.py \
--model super --tp-degree 2 --cp-degree 2 --enable-sound --prompt "A waterfall in a forest."
```

Notes:

- The helpers use the `native` attention backend (the only one that supports GQA's `enable_gqa`),
and expand the KV heads to the query-head count so SDPA picks the flash kernel — passing
`enable_gqa=True` forces the math kernel, which materializes the full `[S, S]` scores and OOMs
on long sequences.
- Only Ulysses is supported (not ring attention).
- The CP/Ulysses degree must divide the query heads (32 for `Nano`, 64 for `Super`). For TP,
`tp` must divide the KV heads (8), and `tp * cp` must divide the query heads.
- TP all-reduces on every block, so it's bandwidth-heavy — use the smallest TP degree that makes
the weights fit and put the remaining GPUs into CP.
- Generation size is set with the usual CLI flags (`--num-frames` / `--height` / `--width`), and
multi-GPU runs require a seed for reproducibility across ranks (the CLI sets one if you omit `--seed`).
- On some multi-GPU topologies the first NCCL all-to-all can hang; if a run stalls at the first
denoising step, set `NCCL_P2P_DISABLE=1` before launching.

See the [pipeline docs](../../docs/source/en/api/pipelines/cosmos3.md#context-parallelism) for how
to enable CP and TP from your own pipeline code.
Loading
Loading