huggingface · atharvajoshi10 · Jun 9, 2026 · Jun 23, 2026
diff --git a/docs/source/en/api/pipelines/cosmos3.md b/docs/source/en/api/pipelines/cosmos3.md
@@ -464,6 +464,95 @@ if result.action is not None:
 </hfoption>
 </hfoptions>
 
+## Context parallelism
+
+For long videos or high resolutions, a single forward pass can exceed the memory and latency budget of one GPU. Cosmos 3 supports **context parallelism (CP)** to shard the sequence dimension across multiple GPUs, splitting the attention computation so each device holds only a slice of the tokens.
+
+Cosmos 3 supports **Ulysses** context parallelism (all-to-all sequence/head exchange). Ring attention is not supported.
+
+Unlike most diffusers models, Cosmos 3 does **not** wire CP into the transformer or the declarative [`~ModelMixin.enable_parallelism`] path: its grouped-query attention, separate understanding/generation streams (the generation stream attends to both), and ragged per-stream lengths can't be expressed as a `_cp_plan`. Instead, the model exposes small no-op shard/gather seams, and the implementation lives in [`examples/cosmos3/cosmos_parallel.py`](https://github.com/huggingface/diffusers/blob/main/examples/cosmos3/cosmos_parallel.py) — a self-contained module you can read end to end and adapt. It offers two orthogonal, composable sharding axes:
+
+| Helper | Shards | Use for |
+|---|---|---|
+| `enable_cosmos3_context_parallel(transformer, cp_mesh)` | sequence (CP / Ulysses) | latency on a model that fits one GPU (`Nano`) |
+| `enable_cosmos3_tensor_parallel(transformer, tp_mesh)` | weights (TP) | fitting a model that doesn't fit one GPU (`Super`) |
+
+Use either alone or both together on a 2-D `(tp, cp)` mesh (see [Fitting large models with tensor parallelism](#fitting-large-models-with-tensor-parallelism)).
+
+Two requirements are specific to Cosmos 3:
+
+- Use the `native` attention backend. Cosmos 3 uses grouped-query attention (GQA), and the native SDPA backend is the only one that accepts `enable_gqa` (cuDNN and flash reject it). The helpers expand the KV heads to the query-head count and call SDPA with `enable_gqa=False` so it still dispatches to the flash kernel (the math fallback would materialize the full `[S, S]` scores and OOM on long sequences).
+- The CP (Ulysses) degree must divide the query-head count (32 for `Nano`, 64 for `Super`); for TP, the degree must divide the KV heads (8). The understanding (text) and generation (video/sound) streams are sharded independently along the sequence, and ragged lengths are zero-padded internally to a multiple of the world size.
+
+### Run it
+
+The full CLI [`examples/cosmos3/inference_cosmos3.py`](https://github.com/huggingface/diffusers/blob/main/examples/cosmos3/inference_cosmos3.py) reuses these helpers, so **any modality** (text-to-image/video, image-to-video, sound, action modes) runs multi-GPU via `--tp-degree` / `--cp-degree`. Launch with [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html); `--tp-degree * --cp-degree` must equal `--nproc_per_node`. Every rank produces the same output; rank 0 writes it.
+
+```bash
+# CP only — Nano (fits one GPU); CP degree must divide 32 query heads.
+torchrun --nproc_per_node=4 examples/cosmos3/inference_cosmos3.py --model nano --cp-degree 4 --prompt "..."
+
+# TP only — Super; TP degree must divide 64 query heads and 8 KV heads.
+torchrun --nproc_per_node=4 examples/cosmos3/inference_cosmos3.py --model super --tp-degree 4 --prompt "..."
+
+# TP + CP — Super, with sound (TP=2 x CP=2 across 4 GPUs).
+torchrun --nproc_per_node=4 examples/cosmos3/inference_cosmos3.py \
+    --model super --tp-degree 2 --cp-degree 2 --enable-sound --prompt "..."
+```
+
+`Super`'s ~120 GB of weights do not fit on one 96 GB GPU, so it needs TP; `Nano` fits on a single GPU, so CP for it is a pure latency optimization. (Omit both flags to run single-GPU.)
+
+### Fitting large models with tensor parallelism
+
+CP shards *activations* but replicates every weight on every rank, so it does not reduce a model's weight footprint — a model that doesn't fit on one GPU still won't fit under CP alone. To shard the **weights**, `enable_cosmos3_tensor_parallel(transformer, tp_mesh)` applies Megatron-style tensor parallelism on a second, orthogonal mesh axis:
+
+- The attention and MLP projections are column/row sharded across the TP group (`to_q/to_k/to_v` + `add_q/k/v` and the MLPs' `gate/up` are column-parallel; `to_out/to_add_out` and the MLPs' `down` are row-parallel with an all-reduce). Each rank ends up owning `query_heads / tp` query heads and `kv_heads / tp` KV heads.
+- TP composes with CP on a 2-D `(tp, cp)` device mesh: TP splits heads/weights persistently, CP shards the sequence on top. The constraints are `tp` divides the KV heads (8), and `tp * cp` divides the query heads (32 for `Nano`, 64 for `Super`).
+- Weights are loaded to CPU and sharded onto the GPUs layer by layer, so the full model is never materialized on a single device.
+
+> [!TIP]
+> TP issues an all-reduce on every attention and MLP block, so it is bandwidth-heavy. On hosts without NVLink it is the dominant cost; prefer the smallest TP degree that makes the weights fit and put the remaining GPUs into CP.
+
+### Use it in your own pipeline
+
+The CLI flags are convenient, but you can call the helpers directly. Build the device mesh, apply TP *before* the model lands on the GPUs, switch to the `native` backend, then enable CP — the rest of your pipeline code is unchanged:
+
+```python
+from torch.distributed.device_mesh import init_device_mesh
+
+# Make the helper module importable.
+import sys
+sys.path.insert(0, "examples/cosmos3")
+from cosmos_parallel import (
+    enable_cosmos3_context_parallel,
+    enable_cosmos3_flash_attention,
+    enable_cosmos3_tensor_parallel,
+)
+
+# torchrun sets RANK / WORLD_SIZE / LOCAL_RANK. Pick tp_degree * cp_degree == world size.
+mesh = init_device_mesh("cuda", (tp_degree, cp_degree), mesh_dim_names=("tp", "cp"))
+
+# Load on CPU first; a TP-sharded model may not fit one GPU.
+pipe = Cosmos3OmniPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
+if tp_degree > 1:
+    enable_cosmos3_tensor_parallel(pipe.transformer, mesh["tp"])  # shard weights -> GPUs
+pipe.to(f"cuda:{local_rank}")                                     # move the replicated remainder
+pipe.transformer.set_attention_backend("native")
+if cp_degree > 1:
+    enable_cosmos3_context_parallel(pipe.transformer, mesh["cp"])  # shard the sequence
+elif tp_degree > 1:
+    enable_cosmos3_flash_attention(pipe.transformer)               # GQA-safe dense attention
+
+# `pipe(...)` is called exactly as in the single-GPU workflows above.
+```
+
+For CP only (no weight sharding), use a 1-D mesh: `init_device_mesh("cuda", (world_size,), mesh_dim_names=("cp",))` and just `enable_cosmos3_context_parallel`.
+
+> [!TIP]
+> On some multi-GPU topologies the first NCCL all-to-all can hang. If a CP run stalls at the start of the first denoising step, set `NCCL_P2P_DISABLE=1` in the environment before launching `torchrun`.
+
+CP and TP compose with all the workflows above (text-to-video, image-to-video, text-to-video with sound, and action-conditioned generation) and with both the `Nano` and `Super` checkpoints — only the pipeline construction and the parallelism setup lines change.
+
 ## Metadata templates
 
 `tokenize_prompt` appends short metadata sentences inside the user message so the LLM sees the conditioning the model was trained with. The positive prompt gets sentences like *"The video is 7.9 seconds long and is of 24 FPS."* and *"This video is of 720x1280 resolution."*; the negative prompt gets the inverse (*"… is not …"*).

diff --git a/examples/cosmos3/README.md b/examples/cosmos3/README.md
@@ -5,9 +5,14 @@ The canonical reference for `Cosmos3OmniPipeline` lives in the diffusers docs:
 examples there as the source of truth for application code — they cover text-to-image,
 text-to-video, image-to-video, and text+sound modes.
 
-This directory provides a small CLI wrapper (`inference_cosmos3.py`) that exercises the full
-load → encode → denoise → decode path against either the Hub release or a local checkpoint
-during development.
+This directory provides two files:
+
+- `inference_cosmos3.py` — the runnable CLI (text-to-image/video, image-to-video, sound, action
+  modes). Single-GPU by default; pass `--tp-degree` / `--cp-degree` and launch with `torchrun`
+  to run any modality multi-GPU (see [Multi-GPU inference](#multi-gpu-inference-context-parallelism)
+  below).
+- `cosmos_parallel.py` — the importable multi-GPU helpers (context + tensor parallelism). No
+  `main`; the CLI imports from it. Read it to understand or adapt the sharding.
 
 ## Setup
 
@@ -168,3 +173,57 @@ Pick the tier that matches the native resolution of your conditioning input (`48
 | `--no-duration-template` | off | Skip the duration metadata sentence appended to the prompt and negative prompt. Ignored for `--num-frames 1` and for action modes (which build a structured caption instead). |
 | `--no-resolution-template` | off | Skip the resolution metadata sentence appended to the prompt and negative prompt. Ignored for action modes. |
 | `--output` | `.` | Directory to write `sample.jpg` or `sample.mp4`. |
+
+## Multi-GPU inference (context parallelism)
+
+Cosmos 3 can be sharded across GPUs on two orthogonal axes (implemented in `cosmos_parallel.py`):
+
+- **Context parallelism (CP)** — `enable_cosmos3_context_parallel`. The *sequence* is sharded
+  across GPUs and attention runs with two Ulysses all-to-all collectives per layer, cutting
+  per-step latency for long videos / high resolutions. Weights are replicated, so this is for
+  models that already fit one GPU (`Nano`).
+- **Tensor parallelism (TP)** — `enable_cosmos3_tensor_parallel`. The attention and MLP *weight*
+  matrices are sharded across GPUs (Megatron-style), so a checkpoint that doesn't fit one GPU
+  (`Super`, ~120 GB) loads. The sequence is not sharded.
+- **TP + CP** — both at once on a 2-D `(tp, cp)` mesh: a large model *and* latency.
+
+The model itself carries no parallelism logic — it exposes small no-op shard/gather seams, and
+`cosmos_parallel.py` implements the entire path (collectives, GQA KV-head handling, ragged-length
+padding, the dual-pathway attention, weight sharding) behind those two helpers. It is meant to be
+read end to end and adapted.
+
+The CLI imports these helpers, so you run **any modality** (text-to-image/video, image-to-video,
+sound, action modes) multi-GPU by adding `--tp-degree` / `--cp-degree` and launching with
+[torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html) — `--tp-degree * --cp-degree`
+must equal `--nproc_per_node`:
+
+```bash
+# CP only (Nano): CP degree must divide the 32 query heads.
+torchrun --nproc_per_node 4 examples/cosmos3/inference_cosmos3.py --model nano --cp-degree 4 --prompt "..."
+
+# TP only (Super): TP degree must divide the 64 query heads and 8 KV heads.
+torchrun --nproc_per_node 4 examples/cosmos3/inference_cosmos3.py --model super --tp-degree 4 --prompt "..."
+
+# TP + CP (Super), 4 GPUs as 2 x 2, with sound:
+torchrun --nproc_per_node 4 examples/cosmos3/inference_cosmos3.py \
+    --model super --tp-degree 2 --cp-degree 2 --enable-sound --prompt "A waterfall in a forest."
+```
+
+Notes:
+
+- The helpers use the `native` attention backend (the only one that supports GQA's `enable_gqa`),
+  and expand the KV heads to the query-head count so SDPA picks the flash kernel — passing
+  `enable_gqa=True` forces the math kernel, which materializes the full `[S, S]` scores and OOMs
+  on long sequences.
+- Only Ulysses is supported (not ring attention).
+- The CP/Ulysses degree must divide the query heads (32 for `Nano`, 64 for `Super`). For TP,
+  `tp` must divide the KV heads (8), and `tp * cp` must divide the query heads.
+- TP all-reduces on every block, so it's bandwidth-heavy — use the smallest TP degree that makes
+  the weights fit and put the remaining GPUs into CP.
+- Generation size is set with the usual CLI flags (`--num-frames` / `--height` / `--width`), and
+  multi-GPU runs require a seed for reproducibility across ranks (the CLI sets one if you omit `--seed`).
+- On some multi-GPU topologies the first NCCL all-to-all can hang; if a run stalls at the first
+  denoising step, set `NCCL_P2P_DISABLE=1` before launching.
+
+See the [pipeline docs](../../docs/source/en/api/pipelines/cosmos3.md#context-parallelism) for how
+to enable CP and TP from your own pipeline code.