diff --git a/docs/api-guide/core/generalized_tensor_parallel.md b/docs/api-guide/core/generalized_tensor_parallel.md
new file mode 100644
index 00000000000..5a748d5c36e
--- /dev/null
+++ b/docs/api-guide/core/generalized_tensor_parallel.md
@@ -0,0 +1,419 @@
+# Generalized Tensor Parallelism (GTP)
+
+> ⚠️ **Experimental.** GTP is an experimental feature and its API, configuration, and behavior may change in future versions without notice.
+
+**At a glance.** GTP shards every linear weight 1/N along `out_features` across a dedicated GTP process group. The full weight is rematerialized on the fly via an asynchronous all-gather that overlaps with the previous layer's compute on both the forward and backward passes, and the wgrad is reduce-scattered the same way on the way back. Effective per-GPU weight memory shrinks to `1/N`, and the design composes orthogonally with TP / SP / EP / DDP / CUDA Graphs.
+
+**Scope**: a high-level summary of GTP — design intent, public CLI surface, and Megatron-LM ↔ TransformerEngine integration touchpoints.
+
+Core implementation: `megatron/core/tensor_parallel/generalized_tensor_parallelism.py`. The public surface is re-exported from `megatron/core/tensor_parallel/gtp.py`. Low-precision tensor primitives (FP8 / MXFP8 / NVFP4) remain in TransformerEngine and are imported by `generalized_tensor_parallelism.py`.
+
+**Outline:**
+
+1. [Features](#1-features)
+   - 1.1 [Fine-grained, per-weight materialization & gradient reduction](#11-fine-grained-per-weight-materialization--gradient-reduction)
+   - 1.2 [CUDA graph compatibility](#12-cuda-graph-compatibility)
+   - 1.3 [Low-precision quantize-then-gather](#13-low-precision-quantize-then-gather)
+   - 1.4 [Composability with TP / SP / EP / DDP](#14-composability-with-tp--sp--ep--ddp)
+   - 1.5 [Opt-in, minimally invasive integration](#15-opt-in-minimally-invasive-integration)
+   - 1.6 [Optimizer-agnostic (Adam + Muon)](#16-optimizer-agnostic-adam--muon)
+   - 1.7 [Scaling](#17-scaling)
+   - 1.8 [Native distributed checkpointing (DCP)](#18-native-distributed-checkpointing-dcp)
+2. [Usage](#2-usage)
+   - 2.1 [Required flags](#21-required-flags)
+   - 2.2 [High-priority streams (Blackwell and later)](#22-high-priority-streams-blackwell-and-later)
+   - 2.3 [Minimal end-to-end example](#23-minimal-end-to-end-example)
+   - 2.4 [Tuning knobs](#24-tuning-knobs)
+3. [Implementation details](#3-implementation-details)
+   - 3.1 [GTP architecture (Mcore ↔ TE integration)](#31-gtp-architecture-mcore--te-integration)
+   - 3.2 [DDP buckets with (E)GTP](#32-ddp-buckets-with-egtp)
+   - 3.3 [Distributed checkpointing (DCP)](#33-distributed-checkpointing-dcp)
+4. [Testing](#4-testing)
+
+---
+
+## 1. Features
+
+### 1.1 Fine-grained, per-weight materialization & gradient reduction
+
+Each weight is sharded 1/N across a GTP group along `out_features`, stored as a `GTPShardedParam` subclass of `nn.Parameter`. Materialization and gradient reduction are both **per-weight, per-call** — not per-model or per-module:
+
+- **Independent state per param**: each has its own AG state (`state`) and RS state (`rs_state`) machines, both cycling `NONE → ASYNC_WAIT → DATA_READY → NONE` and tracked separately so fwd and bwd async ops don't interfere.
+- **Prefetch chain for AG** (doubly-linked `prev_w` / `next_w`): during fwd, each weight's `all_gather_and_prefetch` issues async AG for `next_w`; during bwd, `all_gather_and_prefetch_bwd` issues async AG for `prev_w`. Layer *i*'s AG overlaps with layer *i−1*'s GEMM. For an L-layer model, L−1 all-gathers are fully hidden behind compute. When activation recompute is enabled, a **third** chain prefetches the recompute-forward gathers during backward — see §3.1 *Recompute-forward prefetch chain*.
+- **Deferred RS finalize for wgrad**: `wgrad_reduce_scatter` on param *i* launches an **async** reduce-scatter (handle stashed in `_wgrad_rs_handle`) and returns `None` to autograd — the wgrad is NOT finalized into `main_grad` yet. Finalization is **deferred one step**: the next bwd step (param *i−1*'s `wgrad_reduce_scatter`) calls `self.next_w._wait_reduce_scatter()` + `_finalize_wgrad()`, which waits on the stashed handle, accumulates the reduced wgrad into `main_grad`, and fires the DDP `register_grad_ready` hook. The chain's head (first-in-fwd, last-in-bwd) uses a synchronous RS since nothing follows it. This one-step deferral is what lets layer *i*'s RS overlap with layer *i−1*'s bwd GEMMs.
+- **Cold start only**: every weight's very first AG is synchronous (`DATA_READY_SYNC`, no prefetch has run yet); the async prefetch chain kicks in from the second forward onward.
+
+Contrast with FSDP: FSDP gathers at module-group granularity in full precision with PyTorch-managed lifecycle. GTP works at individual-weight granularity, in quantized form, with its own explicit ticket-based buffer pool and a one-step-deferred RS finalizer.
+
+> **FSDP can't shrink into GTP because FSDP's overlap is bucket-grained by design** — bucket granularity exists *to avoid* paying NCCL launch latency on tiny params (LayerNorm γ/β, biases, Mamba `dt_bias`/`D`/`A_log`) and *to avoid* the per-weight scheduling state that GTP relies on (per-param prefetch chain, ticket-based buffer cache, stream choreography). Removing buckets doesn't make FSDP faster; it makes FSDP into GTP, with all the engineering that entails — selective wrapping (only large GEMM weights), per-weight prefetch chain, per-param buffer ticket, and explicit AG/RS stream choreography on a side stream so external drains have something meaningful to wait on.
+
+### 1.2 CUDA graph compatibility
+
+CG compatibility is designed-in from day one, not retrofitted. The entire sync / buffer / chain architecture is shaped around making **captured fwd/bwd replays produce identical bit-for-bit behavior** — without the usual capture-vs-eager pitfalls that force other weight-sharding schemes to either disable CG or require special handling.
+
+- **Two chains, never cross-linked** (`GTPChain.GRAPHED` / `GTPChain.UNGRAPHED`). `prev_w` / `next_w` only connect same-chain params, so a captured traversal never reaches into eager Python and vice-versa.
+- **`torch.cuda.Event(external=True)`** for `ag_event` / `rs_event` — the events survive CG capture boundaries and can be waited on from replay-time streams.
+- **Idempotent ticket cache**: `GTPWeightCache.get(ticket)` keeps `slot.buf` set even after `release()`, so replays read the same buffer address as capture. `clear()` drops buffers while keeping tickets valid → supports CG re-capture with lazy re-allocation.
+- **Allocate-in-pool at creation** (`set_cuda_graph_mempool` + `_graphed_alloc`): GRAPHED-chain AG/RS buffers and quantized weight storage are allocated **directly into the CG memory pool** at first creation (during warmup, before capture), so no CUDA allocations happen inside the captured graph — and no post-hoc reallocation/clone is needed. UNGRAPHED buffers stay in regular allocator memory.
+- **Lazy, one-shot chain linking**: `prefetch_initialized` is flipped during the first fwd (warmup), so the chain-construction Python side-effects never execute inside a captured graph. The link table is buffered and flushed atomically at the second forward.
+- **DDP hook manual triggering**: `register_grad_accum_hook` stores the DDP hook on the param; `_CudagraphReplayNode.backward` calls it manually after replay (since `AccumulateGrad` hooks are silenced by replay). This is also how the `assert self.grad_reduce_handle is not None` failure from partial-CG + overlap-grad-reduce is resolved.
+- **Warmup is side-effect-free on `main_grad`**: GTP accumulates wgrad into `main_grad` *inside* the backward (the fusion path returns wgrads as graph outputs instead). Graph capture only *records* ops; it never runs them. But `create_fwd_graph` runs an **eager** warmup fwd+bwd before capturing. That warmup backward executes GTP's `main_grad.add_`. Its deferred cascade adds into a cross-graph `next_w` (another module) from a **stale RS ticket** — the prior backward's wgrad. And `create_cudagraphs()` runs *after* `finalize_model_grads`. So this overwrites the finalized (reduced + per-token-scaled) grads and spikes the step's grad norm. **Fix**: `create_fwd_graph` snapshots the grads its warmup touches — own params + cross-graph `next_w` — via `_backup_grads_before_capture`, then restores them after capture. The bwd graph has no warmup, so it needs none. Bounded to one module's grads.
+- **Drains at CG / eager boundary**: `_drain_gtp_side_streams()` before eager MoE expert compute. Inside bwd capture, two-phase drain: Phase 1 joins the within-graph cascade and records `bwd_completion_event` (next runner unblocks); Phase 2 calls `wait_async_comms(GRAPHED)` to drain the chain-tail handle and re-joins side streams (queued after the event so it doesn't delay the next runner).
+- **Side-stream registration**: the `(GRAPHED, gtp_group)` ag/rs streams are materialized at runner init (`_register_gtp_side_streams`) so they are captured before the first forward.
+
+### 1.3 Low-precision quantize-then-gather
+
+Wire bandwidth scales with the **quantized** size, not BF16 size — GTP composes with low-precision training rather than fighting it.
+
+- **FP8 / MXFP8**: quantize kernel runs per microbatch on the local shard with no GTP-group amax reduction (FP8 amax allreduce is the standard DP-group one in `reduce_and_update_fp8_tensors`, unchanged by GTP). On subsequent microbatches, `skip_weight_cast=True` reuses the quantized buffer.
+- **NVFP4** (4-bit, block-scaled): amax reduced across the GTP group before scaling so ranks share a consistent scale for the full weight; custom `_all_gather_nvfp4` handles rowwise + columnwise views and interleaved layout. Post-processing (re-assemble interleaved data, re-pad `scale_inv`, transition to `GEMM_READY`) is deferred into `_NVFP4AllGatherAsyncHandle.wait()` so it stays off the critical path.
+- **Coalesced NCCL**: `grouped_gather_along_first_dim` uses `torch.distributed._coalescing_manager` to batch E experts' AGs into a single NCCL op. `BatchedNVFP4AllGatherAsyncHandle` wraps per-expert post-processing.
+- **Padding**: at construction the **full tensor** is padded along dim0 to a multiple of `pad_for_alignment × gtp_size`, then sharded equally across the group. After all-gather, the padding ends up contiguous at the tail, so stripping is a single trailing slice (`tensor[:-pad_length]`) — no per-shard reshuffle, and the design naturally supports `pad_length` large enough to span multiple ranks' shards when the unpadded dim0 is small.
+
+#### Per-microbatch schedule
+
+```
+Steady-state fwd (NVFP4):
+    default: ──GEMM(W_0)──quant+amax(W_1)──GEMM(W_1)──quant+amax(W_2)──GEMM(W_2)──...
+    ag_str:                       [AG_issue W_1]            [AG_issue W_2]
+
+Steady-state fwd (FP8 / MXFP8):
+    default: ──GEMM(W_0)────quant(W_1)─────GEMM(W_1)────quant(W_2)─────GEMM(W_2)──...
+    ag_str:                       [AG_issue W_1]            [AG_issue W_2]
+                              (no GTP-group amax allreduce)
+
+Steady-state bwd (all recipes):
+    default: ──bwd GEMMs(W_i)──...
+    ag_str:               [AG_issue W_{i-1}]
+                          (bwd reuses fwd's quantized buffer; no quant, no amax)
+```
+
+quant+amax run sequentially with surrounding compute on the default stream; only the `dist.all_gather` issue is wrapped in `with torch.cuda.stream(ag_stream)`. The NCCL kernel runs on c10d's private ncclStream and overlaps with the next GEMM until it reaches its wait.
+
+For NVFP4 the per-microbatch prefetch cost is **two** NCCL ops on the GTP ncclStream (amax allreduce + AG) serialized on the same communicator. FP8 and MXFP8 incur only the AG; their standard DP-group amax allreduce in `reduce_and_update_fp8_tensors` is unchanged by GTP. BF16 skips quant entirely.
+
+#### Communication volume breakdown
+
+Per-microbatch per-weight comm budget (assuming bf16 wgrad reduce-scatter):
+
+| Format | Block | Data B/elem | Scale_inv B/elem | Per-elem | Fwd AR(amax)                   | Fwd AG | Bwd AG | Wgrad RS (bf16) | Total B/elem | vs BF16        |
+|--------|-------|-------------|------------------|----------|--------------------------------|--------|--------|-----------------|--------------|----------------|
+| BF16   | n/a   | 2.0000      | —                | 2.0000   | —                              | 2.0000 | 2.0000 | 2.0000          | 6.0000       | 1.00× (baseline) |
+| MXFP8  | 32    | 1.0000      | 1/32 = 0.0313    | 1.0313   | — (microscale, no global amax) | 1.0313 | 1.0313 | 2.0000          | 4.0626       | 0.68× (–32%)   |
+| NVFP4  | 16    | 0.5000      | 1/16 = 0.0625    | 0.5625   | ≈0 in volume (latency-bound)   | 0.5625 | 0.5625 | 2.0000          | 3.1250       | 0.52× (–48%)   |
+
+How to read the columns:
+- `Per-elem` = `Data B/elem + Scale_inv B/elem` — wire cost of one quantized weight buffer (data + scale_inv together).
+- `Fwd AG` and `Bwd AG` each carry the quantized buffer once, so they equal `Per-elem`. Bwd reuses fwd's `self.quantized` buffer — no re-quantize, no AR(amax).
+- `Wgrad RS (bf16)` = 2.0 B/elem — gradient is reduce-scattered in bf16 regardless of weight precision.
+- `Fwd AR(amax)` is a separate NCCL collective: NVFP4 needs it (one fp32 scalar per tensor → ~0 B/elem volume but a fixed launch latency); MXFP8 doesn't (microscale-only).
+- `Total B/elem` = `Fwd AG + Bwd AG + Wgrad RS` — amax AR is omitted because its volume is essentially 0.
+
+Quantize-then-gather attacks AG only: AG portion shrinks ~72% from BF16 → NVFP4, but RS is untouched, so the wgrad RS becomes the dominant comm path in NVFP4 (~64% of the budget at bf16 RS, ~78% at fp32 RS).
+
+### 1.4 Composability with TP / SP / EP / DDP
+
+- **TP** (intra-layer): orthogonal axis — GTP shards `out_features` regardless of TP's parallel mode (column or row). 2D grid naturally formed via `tp_group × gtp_group`.
+- **SP** (sequence-parallel): transparent — GTP operates at weight dim, SP at sequence dim.
+- **EP** (MoE): `GroupedLinear` with GTP → each routed expert sharded across `EXPERT_GTP_WEIGHT_REMAT_GROUP`, independent of EP. MoE AllToAll (HybridEP/NVLink) runs independently of GTP AG/RS (NCCL/IB).
+- **DDP**: GTP bypasses autograd's grad accumulator (async RS returns `None`; `_finalize_wgrad` accumulates directly into `main_grad`). DDP registers its grad-ready hook on GTP params via `register_grad_accum_hook` (not autograd's `AccumulateGrad`); GTP invokes it from `_finalize_wgrad` (eager path) and `_CudagraphReplayNode.backward` (captured path) **after** the wgrad lands in `main_grad`, so a bucket's DDP reduce-scatter runs strictly after every GTP param's `{RS → main_grad add}` — never over a stale `main_grad` — and DDP↔GTP NIC deadlock at IB scale is avoided. See §3.2.
+
+### 1.5 Opt-in, minimally invasive integration
+
+- Drop-in `gtp_group` kwarg on `Linear` / `LayerNormLinear` / `LayerNormMLP` / `GroupedLinear`; no framework-level refactor required.
+- **Per-weight opt-in.** GTP wraps only weights threaded with the `gtp_group=` kwarg — typically the heavy GEMM linears (`Linear` / `LayerNormLinear` / `LayerNormMLP` / `GroupedLinear`). Small replicated tensors (LayerNorm γ/β, biases, Mamba `dt_bias`/`A_log`/`D`/`conv1d`, MoE router, latent-proj MLPs) stay full — no NCCL launch latency for params where the all-gather wouldn't amortize. The split is visible in §3.2's *dense non-GTP* vs *dense GTP* membership.
+- `classify_gtp_chains(model)` walks `named_parameters()` once at init and sets `chain_id` on every `GTPShardedParam` based on the current `cuda_graph_modules`.
+- Turning it off is a no-op: when `gtp_group.size() == 1`, `wrap_module_params_gtp` short-circuits; when `gtp_weight_remat_size == 1`, the GTP path in `layers.py` is skipped entirely.
+- User-tunable knobs (`GTPConfig.pad_for_alignment`, `weight_prefetch`, `check_param_states`) plus a debug-name tagger (`tag_gtp_params_with_names`) for readable link-table output.
+
+### 1.6 Optimizer-agnostic (Adam + Muon)
+
+GTP runs under both the standard **Adam** `DistributedOptimizer` and **Muon** (the `LayerWiseDistributedOptimizer`), DCP save/load included:
+
+- **Adam** shards optimizer state over the gtp/egtp-excluded replicate group, like any GTP run (§3.2).
+- **Muon** keeps matrix params *whole* (Newton–Schulz needs the full 2D weight). A GTP-replicated whole param (e.g. MoE router, latent-proj MLPs) then lands on one checkpoint key shared by all GTP peers, so the LayerWise optimizer folds `gtp_rank` into its `replica_id` — exactly one peer writes (the optimizer-state analog of the model-side fold in §3.3). Mamba `in_proj` (a gathered+split factory on the model side) saves its optimizer state per-shard via a small backfill helper.
+
+Neither path adds a GTP-specific checkpoint format or call site.
+
+### 1.7 Scaling
+
+Effective per-GPU weight size = `W / (TP × GTP)`. Example: TP=4 + GTP=8 with NVFP4 → 32× weight-memory reduction and 128× wire-bandwidth reduction vs full BF16 replication, before data parallelism.
+
+**Weak scaling.** GTP fixes the shard width and grows the job by adding data-parallel replicas (DP = #GPUs / GTP), so per-GPU compute stays constant while only the DP gradient reduction widens with scale.
+
+The best GTP size is model- and cluster-dependent — driven by weight sizes, per-GPU memory headroom, and which collectives can be kept on fast links — so there is no single recommended value. The example below runs on **GB200 NVL72** (a 72-GPU NVLink domain) and uses **GTP64**, which places communication as:
+
+- **NVLink-local:** the *dense-layer* (Mamba / attention / shared-expert) GTP weight all-gather + wgrad reduce-scatter, **and** the `EP64` all-to-all dispatch/combine — all kept inside one ≤72-GPU NVLink domain (EP64 ≤ NVL72).
+- **Inter-node (IB / CX7):** the DP gradient reduction **plus** the `EGTP2` expert-weight all-gather / wgrad reduce-scatter, whose 2 shards land on different NVLink domains and so cross nodes.
+
+On an Ultra-proxy hybrid Mamba-MoE model (**~280B parameters**; `GTP64 · EP64 · EGTP2`, mb1, MXFP8, BF16 reduce-scatter, no CUDA graph), scaling efficiency holds **≥93 % of the single-domain (128-GPU / DP2) baseline out to 3072 GPUs (DP48)**, while max reserved memory *decreases* with scale (137 → 104 GB) as the distributed optimizer shards optimizer/grad state across more DP replicas.
+
+> **Takeaway:** near-flat weak scaling — **≥93 % efficiency from 128 → 3072 GPUs**, with per-GPU memory shrinking as DP grows.
+
+![GTP64 weak-scaling efficiency](../../images/generalized_tensor_parallel/0617_gtp64_weak_scaling_efficiency.png)
+
+### 1.8 Native distributed checkpointing (DCP)
+
+**GTP + DCP is straightforward:**
+- Reuses the existing checkpoint stack rather than adding a parallel one. GTP-sharded weights *and* distributed-optimizer state save/load through the standard PyTorch / Mcore `torch_dist` sharded checkpoint, with **no GTP-specific format or call path** and a tiny code footprint (one new helper + one helper made GTP-aware).
+- Checkpoints **reshard freely** across different `(TP, GTP, EGTP, DP, PP)` topologies — including a different GTP/EGTP size — with no offline conversion.
+
+See [§3.3 Distributed checkpointing (DCP)](#33-distributed-checkpointing-dcp) for details.
+
+---
+
+## 2. Usage
+
+GTP is enabled through two CLI flags on Megatron's training launcher; everything else (process-group construction, parameter slicing, prefetch chain wiring, optimizer routing) is automatic once the flags are set.
+
+### 2.1 Required flags
+
+```bash
+# Total number of shards each dense weight (attention, mamba, MLP linears) is split into along
+# out_features, across the tensor-parallel + GTP axes. Must be >= --tensor-model-parallel-size and
+# divisible by it. The GTP degree is derived as num_weight_shards / tensor_model_parallel_size
+# (e.g. TP=1 + num_weight_shards=2 -> GTP=2; TP=2 + num_weight_shards=8 -> GTP=4).
+--tensor-parallel-num-weight-shards <num_weight_shards>
+
+# Total number of shards each MoE routed-expert weight is split into along out_features, across the
+# expert-tensor-parallel + expert-GTP axes. Must be >= --expert-tensor-parallel-size and divisible
+# by it. The expert-GTP degree is derived as num_weight_shards / expert_tensor_parallel_size.
+# Independent from --tensor-parallel-num-weight-shards; can be left unset for non-MoE models.
+--expert-tensor-parallel-num-weight-shards <num_weight_shards>
+```
+
+> The (dense / expert) GTP degree is exposed **only** through
+> `--tensor-parallel-num-weight-shards` / `--expert-tensor-parallel-num-weight-shards`. The internal
+> `gtp_weight_remat_size` / `expert_gtp_weight_remat_size` config fields are derived from them and
+> have no CLI flag.
+
+### 2.2 High-priority streams (Blackwell and later)
+
+Required on GB200 / GB300 so the GTP comm streams get the SM priority needed for AG/RS overlap with compute:
+
+```bash
+--high-priority-stream-groups ep gtp expt_gtp tp
+```
+
+The launcher also exports `CUDA_GRAPHS_USE_NODE_PRIORITY=1` so captured CUDA graphs respect the inherited stream priority.
+
+### 2.3 Minimal end-to-end example
+
+```bash
+# 4 ranks, TP=2 + GTP=2 across out_features, BF16 weights.
+# TP=2 + num-weight-shards=4 -> GTP = 4 / 2 = 2.
+torchrun --nproc-per-node 4 pretrain_gpt.py \
+    --tensor-model-parallel-size 2 \
+    --pipeline-model-parallel-size 1 \
+    --tensor-parallel-num-weight-shards 4 \
+    --expert-tensor-parallel-num-weight-shards 1 \
+    --high-priority-stream-groups ep gtp expt_gtp \
+    --bf16 \
+    --num-layers 12 --hidden-size 1024 --num-attention-heads 16 \
+    --seq-length 1024 --max-position-embeddings 1024 \
+    --micro-batch-size 1 --global-batch-size 4 \
+    --train-iters 10 \
+    --use-mcore-models \
+    --transformer-impl transformer_engine \
+    --tokenizer-type NullTokenizer --vocab-size 32000 \
+    --data-path <data> --split 99,1,0
+```
+
+At iter-0 you'll see one rank-0 log line confirming the active config:
+
+```
+GTP enabled. GTPConfig(pad_for_alignment=16, check_param_states=False,
+  weight_prefetch=True, async_reduction=True, fp8_param_gather=False)
+```
+
+### 2.4 Tuning knobs
+
+Set via `from megatron.core.tensor_parallel.gtp import GTP_CONFIG, update_gtp_config`:
+
+```python
+update_gtp_config(
+    pad_for_alignment=16,         # NVFP4: 16, MXFP8: 32, BF16: any; auto-set in training.py
+    weight_prefetch=True,         # Disable to debug the cold-start path
+    async_reduction=True,         # Whether to perform GTP gradient reduction asynchronously
+    fp8_param_gather=False,       # Companion to Megatron's --fp8-param-gather; currently asserted off
+)
+```
+
+`training.py` auto-tunes `pad_for_alignment` based on the quantization recipe (`--fp4`, `--fp8-recipe=mxfp8`, etc.) before model construction. The other knobs are usually left at defaults.
+
+> **CUDA-graph warmup under GTP.** When CUDA graphs are enabled, GTP forces a minimum of **2** per-graph warmup steps regardless of `--cuda-graph-warmup-steps` (e.g. a user-set `0` is bumped to `2`): the first warmup builds the weight-prefetch chain and the second exercises the prefetch path before capture.
+
+---
+
+## 3. Implementation details
+
+### 3.1 GTP architecture (Mcore ↔ TE integration)
+
+![GTP / Mcore-TE integration architecture](../../images/generalized_tensor_parallel/0525_gtp_mcore_te_architecture.png)
+
+TransformerEngine owns the linear primitives (`Linear` / `LayerNormLinear` / `LayerNormMLP` / `GroupedLinear`) and the low-precision tensor types (FP8 / MXFP8 / NVFP4). Megatron-LM owns the GTP scheduling state — the prefetch chain, the ticket-based buffer cache, the per-param AG/RS state machines, the GRAPHED/UNGRAPHED chain split, and the DDP integration. The two are bridged by:
+
+1. The `gtp_group` kwarg that Mcore's `extensions/transformer_engine.py` threads into the TE constructors when `is_te_min_version("2.17.0")`.
+2. The hook registry (`register_gtp_hooks`), called by TE's `module/base.py` at `reset_parameters` time to slice each weight into a `GTPShardedParam` along `out_features`.
+3. The `_register_gtp_side_streams` / drain calls that synchronize TE's quantize + GEMM kernels with the side stream that owns the AG/RS NCCL ops.
+
+#### What the flags do under the hood
+
+1. `parallel_state.initialize_model_parallel(...)` treats GTP/EGTP as **first-class orthogonal axes** (`world_size = TP*GTP*CP*DP`, and the expert grid `= ETP*EP*EGTP*PP*expert_dp`). It builds the shard groups `_GTP_WEIGHT_REMAT_GROUP` (size = `--tensor-parallel-num-weight-shards / --tensor-model-parallel-size`) and `_EXPERT_GTP_WEIGHT_REMAT_GROUP` (size = `--expert-tensor-parallel-num-weight-shards / --expert-tensor-parallel-size`), plus the gtp/egtp-EXCLUDED replicate DP groups (`_DATA_PARALLEL_GROUP_NO_GTP`, `_EXPERT_DATA_PARALLEL_GROUP_NO_GTP`) that DDP and the optimizer shard over. These `*_no_gtp` groups alias the regular DP groups when GTP is inactive (remat size 1).
+2. Megatron's `extensions/transformer_engine.py` reads `pg_collection.gtp` / `pg_collection.expt_gtp` and forwards them as the `gtp_group=` kwarg to `te.Linear` / `te.LayerNormLinear` / `te.GroupedLinear`. TE's `module/base.py` calls back into `megatron.core.tensor_parallel.gtp` via the hook registry (`register_gtp_hooks`) to slice each weight at `reset_parameters` time.
+3. DDP treats GTP shards as ordinary params: they go into the same dense / expert buffers as everything else, reduced over the gtp/egtp-EXCLUDED replicate group (`intra_dp_cp_no_gtp_group` / `intra_expt_dp_no_egtp_group`) with the standard `1/full` scaling. The gtp axis is completed elsewhere — GTP shards by their reduce-scatter sum, replicated (non-GTP) params by a SUM all-reduce in `finalize_model_grads`. See §3.2.
+4. Optimizer state is sharded over the same replicate group; clip-by-global-norm reduces squared norms over the dist-opt grad-stats group, which spans the full world (including the gtp/egtp axis), with replicated non-GTP params counted once per gtp/egtp axis to avoid over-counting.
+5. `classify_gtp_chains(model)` runs once after model build (in `training.py`'s `get_model`) and wires each `GTPShardedParam` into a `GRAPHED` or `UNGRAPHED` prefetch chain based on the active `cuda_graph_modules`.
+
+#### Buffer / memory management
+
+Two distinct pools with explicit lifecycle rules:
+
+- **`GTPWeightCache`** (AG/RS output buffers) — ticket-based, keyed on `(shape, dtype, fwd, expert_idx, reduce_scatter)`. Same-shape buffers across layers are shared. Tickets persistent; buffer allocated lazily on first `get()`; addresses stable across iterations for CG replay.
+- **`_wgrad_buf_pool`** (UNGRAPHED wgrad input recycling) — tagged with `_from_gtp_wgrad_pool=True` at `_wgrad_pool_get`. `_wgrad_pool_put` no-ops on foreign buffers (fresh allocs from Megatron `layers.py` or aten F.embedding bwd) → caching allocator handles those. Prevents the pool from accumulating untagged buffers each iter.
+
+#### Overlap design summary
+
+```
+fwd:  AG(W_{i+1}) ∥ GEMM(W_i)                              ∥ CG replay of captured layers
+bwd:  AG(W_{i-1}) ∥ dgrad(W_i) → wgrad(W_i) ∥ RS(wgrad_i)  ∥ [finalize wgrad_{i+1} + DDP hook]
+```
+
+GTP runs up to **three** independent prefetch chains, all following one rule — *prefetch the weight the next consume will need*:
+
+| # | when | consume | prefetch (overlap) | AG direction | slot |
+|---|------|---------|--------------------|--------------|------|
+| 1 | fwd | weight `i` | `next_w` = i+1 ‖ `GEMM_i` | rowwise (`fwd=True`) | `_prefetch_handle` |
+| 2 | bwd dgrad | weight `i` | `prev_w` = i−1 ‖ `Dgrad_i` | columnwise (`fwd=False`) | `_prefetch_handle` |
+| 3 | bwd recompute | weight `i` | `_recompute_next` = i+1 ‖ `recompute_GEMM_i` | rowwise (`fwd=True`) | `_recompute_prefetch_handle` (separate) |
+
+Chain 3 exists only when activation recompute is on. It mirrors chain 1 (rowwise, prefetch `next`) but runs *during* backward, so it overlaps chain 2 in time on the same weight — hence its **own** slot. fwd (1) and bwd-dgrad (2) never overlap in time, so they safely share `_prefetch_handle`. See *Recompute-forward prefetch chain* below.
+
+At bwd step *i* the step is launching *RS of wgrad_i* while finalizing the *previous* iter's wgrad (`wgrad_{i+1}` in bwd order = the next-one-over in fwd order). That one-step deferral is what makes the RS run concurrent with the next layer's dgrad/wgrad GEMMs instead of blocking after every layer.
+
+Communication never blocks compute except at the very first layer of each direction (cold start) and at enforced serialization points (CG/eager drains, finalize-grads barrier).
+
+##### wgrad-before-dgrad schedule  *(deferred to a follow-up MR)*
+
+Current behavior: backward always runs dgrad GEMM, then wgrad GEMM, then issues the GTP wgrad RS — the RS overlaps with the *next* layer's bwd GEMMs (the one-step deferral above).
+
+A future MR will add an opt-in wgrad-before-dgrad schedule on `_Linear` / `_LayerNormLinear` so the GTP wgrad RS NCCL overlaps with the dgrad GEMM of the **same** layer (best for the GTP + no-TP case).
+
+##### Recompute-forward prefetch chain  *(GTP + activation recompute)*
+
+When a GTP-sharded module is in `--recompute-modules` (e.g. `shared_experts`), its forward is **re-run during backward** to regenerate activations. That recompute-forward must all-gather each weight **rowwise** again — a *third* gather lifecycle, concurrent with the in-flight **columnwise** dgrad gather of the *same* weight. Since both share one `GTPShardedParam`, the recompute path gets its **own** prefetch slot (`_recompute_prefetch_handle` / `_recompute_ag_event`, reusing the `_ag_ticket_fwd` rowwise buffer) so it never clobbers the dgrad lifecycle's `state` / `_prefetch_handle` / `ag_event`.
+
+The recompute weights form a **separate** linked list (`_recompute_next`), **self-populated** on the first backward from the weights actually re-gathered while `in_fp8_activation_recompute_phase()` is true — membership is *observed*, not configured (no tagging, so it tracks exactly what each checkpointed module re-gathers). Each recompute-forward consume prefetches the next recompute weight, so every gather **except the global-first** overlaps preceding recompute / dgrad / wgrad compute:
+
+```
+recompute-fwd of shared_experts  (per layer: GEMM fc1 → SReLU → GEMM fc2, then dgrad+wgrad)
+
+  Before (on-demand):
+    default: AG(fc1)─GEMM fc1─SReLU─AG(fc2)─GEMM fc2─dgrad─wgrad─...   every AG exposed
+  After (recompute chain):
+    default:         GEMM fc1─SReLU─GEMM fc2─dgrad─wgrad─GEMM fc1'─... back-to-back
+    ag_str:  AG(fc1)        [AG fc2]        [AG fc1' (next layer)]     only AG(fc1) exposed
+```
+
+`AG(fc2)` is issued at `fc1`'s consume (overlaps GEMM fc1 + SReLU); `AG(fc1')` for the next layer is issued at `fc2`'s consume, so it overlaps the **whole** layer's `dgrad + wgrad` window. The cross-layer link is what hides every region head except the very first.
+
+Under **full-iteration CUDA graphs** the recompute-forward is captured; `wait_async_comms(GRAPHED)` drains the recompute handle too (sets `_recompute_already_drained`) so the captured consumer skips its cross-graph wait — the same producer-drain pattern as the fwd/bwd chains.
+
+> **When *not* to recompute a GTP weight.** Recompute on a GTP-sharded weight adds this extra rowwise gather. For MLP-like blocks at short context (`SeqLen ≤ 2 × HiddenSize`), GTP-sharding the weight saves *more* memory than recomputing its activations, so the better trade is to keep such modules GTP-sharded and **out** of `--recompute-modules` (offload their activations if needed) — avoiding the third gather entirely. Build the recompute chain only for modules that genuinely need both.
+
+### 3.2 DDP buckets with (E)GTP
+
+![DDP + (E)GTP interaction with the distributed optimizer](../../images/generalized_tensor_parallel/0611_ddp_egtp_orthogonal_bucketing.png)
+
+**(E)GTP is *super loosely coupled* to DDP and the distributed optimizer — they stay completely GTP-agnostic.** GTP is just another sub-axis of the rank grid (`world = TP×GTP×CP×DP`); a GTP-sharded weight rides the *exact same* code path as an ordinary param. There are **no** GTP/EGTP-specific buffers, optimizers, gradient-scaling factors, or bucket groups. The entire DDP/DistOpt stack touches GTP in only **three** narrow places:
+
+1. **finalize SUM all-reduce** (`_allreduce_replicated_grads_over_gtp_group`) — completes the gtp axis for *replicated* (non-GTP) params; a no-op when GTP is inactive.
+2. **`is_gtp` / `allreduce` tags** propagated onto the optimizer's master shards — consumed only by the grad-norm dedup filter.
+3. **grad-ready hook routing** (`DistributedDataParallel.__init__`) — for a GTP param, DDP registers its backward post-hook via GTP's `register_grad_accum_hook` instead of autograd's `AccumulateGrad`. GTP fires it from `_handle_megatron_grad_accum` **after** the per-param `{wgrad RS → main_grad add}`. This enforces the invariant below; a no-op (plain autograd path) when GTP is inactive.
+
+> **Ordering invariant.** A bucket's DDP gradient reduction (the reduce-scatter / all-to-all + local fp32 accumulation) runs **strictly after every GTP param in that bucket has finished `{GTP wgrad RS → main_grad add}`**. `register_grad_ready` only fires the bucket collective once *all* its params are ready, and for GTP params "ready" is signalled by GTP after the add — never by autograd's `AccumulateGrad`, which (because the wgrad RS is async and its `main_grad` accumulation is deferred to a later backward node) can fire **before** the add and would make the bucket reduce read a stale/empty `main_grad` (notably under `reduce_scatter_with_fp32_accumulation`).
+
+Everything else — bucketing, the reduce-scatter/all-reduce schedule and its overlap, master-state sharding, grad clipping, the checkpoint format — is unchanged and unaware of GTP.
+
+**Why this matters:**
+
+- **Free reuse of a mature stack.** GTP inherits DDP's bucketing + comm/compute overlap, the distributed optimizer's fp32-master + Adam-moment sharding, grad-norm/clip, and the existing checkpoint format — no parallel re-implementation to write or maintain (contrast FSDP, which replaces all of these).
+- **Orthogonal composability.** Because GTP is a rank-grid sub-axis cut like TP (along `out_features`), it composes with TP/EP/CP/PP and the DistOpt the same way TP does — no special nesting logic.
+- **Zero-cost when off.** With GTP disabled the `*_no_gtp` groups alias the regular DP groups and both hooks become no-ops, so non-GTP runs hit byte-identical behavior — GTP can be toggled without forking the DDP/optimizer code paths.
+- **Small, auditable surface.** These three hooks are the whole integration contract, which is what makes the correctness argument below tractable.
+
+DDP groups parameters into **two buffers** by `is_expert_parallel` (MoE tag) — a dense buffer and an expert buffer. GTP/EGTP shards are **merged into** these buffers like ordinary params (no separate GTP/EGTP buckets): they reduce over the gtp/egtp-EXCLUDED replicate group (`intra_dp_cp_no_gtp_group` for dense, `intra_expt_dp_no_egtp_group` for expert) with the standard `1/full = 1/(replicate*gtp)` scaling.
+
+Why this is correct — the gtp axis is completed in two complementary ways, so it is summed exactly once:
+
+- **GTP-sharded weights**: each rank already holds the gtp-summed shard via the (E)GTP wgrad reduce-scatter, then DDP sums over the replicate group → `sum-over-(gtp×replicate) / full = mean`.
+- **Replicated (non-GTP) params** (LayerNorm γ/β, biases, router, …): DDP sums only over the replicate group, leaving them `1/gtp` short; `finalize_model_grads._allreduce_replicated_grads_over_gtp_group` then does a SUM all-reduce over the gtp (dense) / egtp (expert) group to recover the full mean. SUM (not AVG) because the `1/full` DDP scaling already applied.
+
+> **`average_in_collective` must be off (the default).** The `1/(replicate×gtp)` scaling above is a *pre-scale* applied before a SUM collective. `average_in_collective=True` instead uses NCCL AVG, which divides by the collective's own group — the gtp/egtp-**excluded** replicate group — so it divides by `replicate` only, missing the `1/gtp` factor and over-scaling gradients by `gtp`. Asserted via `ProcessGroupCollection.is_gtp_active` in both `arguments.py` (training) and `DistributedDataParallel.__init__` (direct megatron-core users).
+
+**`broadcast_params`** (the one-shot init/load param sync) selects the group by `is_gtp`: GTP shards broadcast over the gtp-excluded `*_no_gtp` group (`dp_cp_no_gtp_group` / `expt_dp_no_egtp_group`), everything else over the regular DP group (`dp_cp_group` / `expt_dp_group`). Excluding (E)GTP peers is essential — each peer holds a distinct 1/N shard of the same `GTPShardedParam`, so a shared group would let rank-0's shard clobber the others. The non-`intra_` ("full") groups are used here so the sync reaches every distopt instance.
+
+**Buffer caching.** The per-buffer lists are concatenated once at init into a single flat view for fast iteration in the grad-reduction hot path.
+
+> **Single distopt instance with GTP.** GTP currently requires `num_distributed_optimizer_instances == 1` (asserted in `parallel_state.py`): partial-distopt sharding of the data domain would need gtp-aware sizing. The dist-opt grad-stats group is therefore the full world.
+
+### 3.3 Distributed checkpointing (DCP)
+
+![GTP + DCP save/load reshard for a TP2×GTP2 weight](../../images/generalized_tensor_parallel/0612_gtp_dcp_tp2gtp2_save_load.png)
+
+GTP supports **PyTorch / Mcore sharded distributed checkpointing** (`--ckpt-format torch_dist`, the `megatron.core.dist_checkpointing` `ShardedTensor` / `ShardedObject` format) for **both model weights and distributed-optimizer state**. Checkpoints are **fully resharding-capable**: a checkpoint saved at one `(TP, GTP, EGTP, DP, PP)` topology can be loaded at a *different* one — including a different GTP/EGTP size — without an offline conversion step.
+
+Consistent with §3.2, GTP stays *loosely coupled* to the checkpoint stack: there is **no GTP-specific checkpoint format or call path**. The shared `make_sharded_tensors_for_checkpoint` helper became GTP-aware and **delegates internally** to a GTP variant only when the `state_dict` actually contains a `GTPShardedParam` (a no-op otherwise), so call sites are unchanged and non-GTP runs are byte-identical.
+
+**Save-side call workflow.** The diagram below traces the save path — from `model.sharded_state_dict()` through the `make_*` helpers down to the terminal `ShardedTensor` / `ShardedObject` sinks. The GTP footprint is deliberately tiny: exactly **one new function** (`make_sharded_tensors_for_checkpoint_with_gtp`, in `gtp.py`, which sets `replica_id` for the GTP-*duplicated* entries) plus **one modified function** (the per-tensor `make_tp_sharded_tensor_for_checkpoint` in `core/utils.py`, made GTP-aware in place to emit the GTP-*sharded* offsets). Every other helper is untouched.
+
+![GTP + DCP checkpoint-save call workflow](../../images/generalized_tensor_parallel/0613_gtp_dcp_save_call_workflow.png)
+
+**How a GTP weight is described to DCP.** GTP always shards `out_features` (axis 0). The helper layers that GTP split onto the existing TP offsets in the `ShardedTensor`, so the global tensor DCP sees is the *full, unsharded* weight:
+
+| Weight kind | TP axis | Emitted axis-0 offset | Other axis |
+|-------------|---------|------------------------|------------|
+| Column-parallel (qkv, fc1) | 0 (same as GTP) | composite `(tp_rank·gtp + gtp_rank, tp·gtp)` | — |
+| Row-parallel (proj, fc2) | 1 | GTP-only `(gtp_rank, gtp)` | TP offset on axis 1 |
+| No TP (GTP-only) | – | `(gtp_rank, gtp)` | — |
+
+Because the offsets reconstruct the global shape, the checkpoint is independent of the save-time grid. On load, DCP reads each rank's `[offset : offset+local]` slice from that global and re-tiles it onto the new grid — e.g. `TP1×GTP2`, `TP2×GTP4`, or a DP change.
+
+**replica_id.** GTP peers hold *distinct* shards (not replicas), so they're disambiguated by their offsets, and `replica_id` ranges over the GTP-*included* DP group. **Replicated** tensors that live alongside GTP weights (LayerNorm γ/β, biases, `_extra_state` objects) would otherwise collide across GTP peers, so the helper folds `gtp_rank` into their `replica_id` — exactly one peer is then elected DCP writer per key.
+
+**`_extra_state`.** This is TransformerEngine's per-module **FP8 calibration state** — for delayed-scaling recipes it holds the `recipe`, the forward/backward `scale` tensors and `amax_history` buffers, plus picklable `extra_fp8_variables`; for BF16 (non-FP8) runs it is an empty tensor. Because it is a pickled byte blob rather than a tensor with a meaningful shape, it is emitted as a `ShardedObject` (via `make_sharded_object_for_checkpoint`), not a `ShardedTensor`. Its amax/scale statistics are *per-tensor globals* for the **full** weight (amax is reduced across the FP8 group), so every GTP peer carries an identical copy — which is exactly why it takes the replicated path above, with `gtp_rank` folded into its `replica_id`.
+
+**Alignment padding & cross-topology reshard.** When `_gtp_slice_one_param` pads `out_features` to a multiple of `gtp_size · pad_for_alignment`, the saved global describes the *padded* shape, so the helper sets `allow_shape_mismatch=True`. DCP then tolerates a load-side topology whose alignment yields a different padded size — the unpadded data overlaps and the tail pad rows are zeros GTP recomputes.
+
+>> Note: Mamba's `in_proj` is a special case: it **all-gathers its GTP shards** back to the logical TP-local size and strips the pad *before* saving, so its global is topology-independent and needs no `allow_shape_mismatch`.
+
+**Optimizer state.** The distributed optimizer's master/moment `ShardedObject`s are keyed by `dp_group_idx`. Under GTP/EGTP each peer owns a *different* master shard (the optimizer shards over the gtp/egtp-**excluded** replicate group), so the index is taken from the gtp/egtp-**merged** model-parallel group (`mp_group` for dense, `expt_tp_pp_with_egtp_group` for expert) — giving every peer a distinct key while replicate-group ranks remain true replicas under that key.
+
+**Post-load cache invalidation.** DCP loads weights with in-place writes to `.data`, which leaves the per-shard low-precision cache (`self.quantized`) stale. `reset_gtp_quantize_cache(model)` is called after load (and RL checkpoint reload) so the first forward after resume re-quantizes from the freshly loaded BF16 weight instead of reusing the pre-load cast.
+
+## 4. Testing
+
+**Whenever you add or change a GTP/EGTP feature, run the GTP unit-test suite below as a sanity check before opening a PR.** These tests exercise the full TE↔Mcore path (weight gather/RS, DDP, distributed optimizer, finalize, grad-norm) and catch silent-correctness regressions that don't surface as crashes.
+
+```bash
+# 4 GPUs; uses the custom TransformerEngine and force-enables GTP.
+export MEGATRON_GTP_FORCE_ENABLE=1
+export TE_PATH=/path/to/TransformerEngine        # the GTP-enabled TE build
+export PYTHONPATH="${TE_PATH}:${PYTHONPATH}"
+torchrun --nproc-per-node 4 -m pytest tests/unit_tests/generalized_tensor_parallel/ -v
+```
+
+| Test file | What it guards |
+|-----------|----------------|
+| `test_gtp.py` | Core GTP shard/gather + DDP bucket alignment. |
+| `test_attention_gtp.py` | GTP on attention linears, loss parity vs no-GTP. |
+| `test_mamba_gtp.py` | GTP on Mamba projection weights. |
+| `test_tp_gtp.py` | GTP composed with tensor parallelism (`tp_group × gtp_group`). |
+| `test_moe_egtp.py` | EGTP on MoE routed-expert weights. |
+| `test_gtp_loss_correctness.py` | End-to-end: GTP per-step loss trajectory matches a no-GTP baseline. |
+| `test_gtp_grad_correctness.py` | Gradient + dist-opt + grad-norm numeric parity vs a DP baseline at replicate (DP) > 1. |
+| `test_gtp_cudagraph_grad.py` | Capture-step grad-norm guard (§1.2): `_backup_grads_before_capture`/`_restore_grads_after_capture` keep a graph capture from clobbering finalized `main_grad` (own params + cross-graph `next_w`, incl. routed-expert `weight_list`). |
+| `test_gtp_dcp.py` | Distributed-checkpoint sharding (§3.3): TP×GTP composite/cross-axis offsets, alignment-pad `allow_shape_mismatch`, cross-topology reshard metadata, and quantize-cache reset. |
+| `test_gtp_muon_dcp.py` | GTP + Muon (LayerWise) optimizer-state checkpoint roundtrip (§1.6): `replica_id` fold for GTP-replicated whole params (router, latent-proj). |
+
+All tests require ≥ 4 GPUs and the GTP-enabled TransformerEngine; they self-skip when those are unavailable. A green run (skips for unmet hardware/config are acceptable) is the minimum bar for any GTP change.
diff --git a/docs/api-guide/core/index.md b/docs/api-guide/core/index.md
index 0d39e46e744..af22af6c6e0 100644
--- a/docs/api-guide/core/index.md
+++ b/docs/api-guide/core/index.md
@@ -16,6 +16,7 @@ Low-level API reference for core Megatron components.
 
 transformer
 tensor_parallel
+generalized_tensor_parallel
 pipeline_parallel
 fusions
 distributed
diff --git a/docs/images/generalized_tensor_parallel/0525_gtp_mcore_te_architecture.png b/docs/images/generalized_tensor_parallel/0525_gtp_mcore_te_architecture.png
new file mode 100644
index 00000000000..672f44c9c14
Binary files /dev/null and b/docs/images/generalized_tensor_parallel/0525_gtp_mcore_te_architecture.png differ
diff --git a/docs/images/generalized_tensor_parallel/0611_ddp_egtp_orthogonal_bucketing.png b/docs/images/generalized_tensor_parallel/0611_ddp_egtp_orthogonal_bucketing.png
new file mode 100644
index 00000000000..2d311138e8d
Binary files /dev/null and b/docs/images/generalized_tensor_parallel/0611_ddp_egtp_orthogonal_bucketing.png differ
diff --git a/docs/images/generalized_tensor_parallel/0612_gtp_dcp_tp2gtp2_save_load.png b/docs/images/generalized_tensor_parallel/0612_gtp_dcp_tp2gtp2_save_load.png
new file mode 100644
index 00000000000..937846e9f0b
Binary files /dev/null and b/docs/images/generalized_tensor_parallel/0612_gtp_dcp_tp2gtp2_save_load.png differ
diff --git a/docs/images/generalized_tensor_parallel/0613_gtp_dcp_save_call_workflow.png b/docs/images/generalized_tensor_parallel/0613_gtp_dcp_save_call_workflow.png
new file mode 100644
index 00000000000..b69bd835769
Binary files /dev/null and b/docs/images/generalized_tensor_parallel/0613_gtp_dcp_save_call_workflow.png differ
diff --git a/docs/images/generalized_tensor_parallel/0617_gtp64_weak_scaling_efficiency.png b/docs/images/generalized_tensor_parallel/0617_gtp64_weak_scaling_efficiency.png
new file mode 100644
index 00000000000..03fc587f96a
Binary files /dev/null and b/docs/images/generalized_tensor_parallel/0617_gtp64_weak_scaling_efficiency.png differ
diff --git a/megatron/core/distributed/distributed_data_parallel.py b/megatron/core/distributed/distributed_data_parallel.py
index e313113a448..0f9c3e424cf 100644
--- a/megatron/core/distributed/distributed_data_parallel.py
+++ b/megatron/core/distributed/distributed_data_parallel.py
@@ -81,9 +81,36 @@ def __init__(
         # Assign all required process groups
         self.dp_group = process_group_dict['dp_group']
         self.dp_cp_group = process_group_dict['dp_cp_group']
-        self.intra_dp_cp_group = process_group_dict['intra_dp_cp_group']
         self.expt_dp_group = process_group_dict['expt_dp_group']
-        self.intra_expt_dp_group = process_group_dict['intra_expt_dp_group']
+        # Example process-group sizes (e.g., TP=2, GTP=64, world_size=1024 with PP=CP=EP=1 and
+        # single DistOpt instance):
+        #   model_size = TP x PP x CP x GTP = 2 x 64 = 128  ->  DP = 1024 / 128 = 8.
+        #   The model weights are replicated DP (= 8) times.
+        #     dp_cp_group (degree of batch sharding, includes GTP) = GTP x DP = 64 * 8 = 512.
+        #     dp_cp_no_gtp_group (degree of weight replication, excludes GTP) = 8.
+        #     gtp_group = 64.
+        #     tp_group = 2.
+        #
+        # Data-parallel gradient reductions for each bucket are performed over dp_cp_no_gtp_group
+        # (GTP-excluded group). Data-parallel gradient reductions over the GTP group are completed
+        # separately in the model backward pass.
+        #
+        # See Section 3.2 in `docs/api-guide/core/generalized_tensor_parallel.md`
+        # for more details (including why average_in_collective=False).
+        #
+        # When GTP is disabled, the *_no_gtp groups alias the regular DP groups.
+        self.intra_dp_cp_group = process_group_dict.get(
+            'intra_dp_cp_no_gtp_group', process_group_dict['intra_dp_cp_group']
+        )
+        self.intra_expt_dp_group = process_group_dict.get(
+            'intra_expt_dp_no_egtp_group', process_group_dict['intra_expt_dp_group']
+        )
+        # Full cross-instance, GTP-peer-EXCLUDED groups for broadcast_params (init-time weight
+        # sync must reach all true replicas). Fall back to the full DP groups when GTP is off.
+        self.dp_cp_no_gtp_group = process_group_dict.get('dp_cp_no_gtp_group', self.dp_cp_group)
+        self.expt_dp_no_egtp_group = process_group_dict.get(
+            'expt_dp_no_egtp_group', self.expt_dp_group
+        )
         self.tp_group = process_group_dict['tp_group']
         self.pp_group = process_group_dict['pp_group']
         self.ep_group = process_group_dict['ep_group']
@@ -166,6 +193,15 @@ def __init__(
 
         self.full_param_layout = full_param_layout
 
+        # GTP needs average_in_collective=False: the per-bucket collective runs over the
+        # GTP-EXCLUDED group, so NCCL AVG would miss the 1/gtp factor. arguments.py guards the
+        # training path; this assert covers direct megatron-core users.
+        gtp_active = ProcessGroupCollection.is_gtp_active(process_group_dict)
+        assert not (gtp_active and self.ddp_config.average_in_collective), (
+            "GTP requires average_in_collective=False (the default); averaged collectives reduce "
+            "over the GTP-excluded group and would miss the 1/gtp gradient scaling factor."
+        )
+
         # Compute gradient scaling factors.
         if config.calculate_per_token_loss:
             assert (
@@ -364,6 +400,15 @@ def unmap_weight_tensor(m):
                                         self._make_backward_post_hook(param)
                                     )
                                     break
+                elif getattr(param, 'is_gtp', False) and hasattr(param, 'register_grad_accum_hook'):
+                    # GTP: drive the post-hook from GTP's manual invocation, not autograd's
+                    # AccumulateGrad. GTP issues the wgrad RS async and defers the main_grad add
+                    # to a later backward node, so AccumulateGrad can fire register_grad_ready
+                    # before the wgrad lands in main_grad, dispatching the bucket reduce-scatter on
+                    # stale grad_data (corrupts reduce_scatter_with_fp32_accumulation for
+                    # chain-boundary weights). GTP fires this hook from _handle_megatron_grad_accum
+                    # after the add instead.
+                    param.register_grad_accum_hook(None, self._make_backward_post_hook(param))
                 else:
                     # Expand so we get access to grad_fn.
                     param_tmp = param.expand_as(param)
@@ -460,9 +505,12 @@ def hook(*unused):
             if param in self.param_to_bucket_group:
                 assert param.requires_grad
                 if self.ddp_config.overlap_grad_reduce:
-                    assert (
-                        param.grad is not None
-                    ), 'param.grad being None is not safe when overlap_grad_reduce is True'
+                    # GTP params legitimately have grad=None (async RS writes wgrad straight
+                    # into main_grad), so skip the assertion for them.
+                    if not getattr(param, 'is_gtp', False):
+                        assert (
+                            param.grad is not None
+                        ), 'param.grad being None is not safe when overlap_grad_reduce is True'
                 if param.grad is not None and (
                     not param.grad_added_to_main_grad or getattr(param, 'zero_out_wgrad', False)
                 ):
@@ -585,11 +633,14 @@ def broadcast_params(self):
         """
         for param in self.module.parameters():
             is_expert_parallel = not getattr(param, 'allreduce', True)
+            is_gtp = getattr(param, 'is_gtp', False)
 
+            # Each (E)GTP peer holds a distinct 1/N shard, so broadcast over the (E)GTP-EXCLUDED
+            # group — else rank-0's shard would clobber the others.
             if is_expert_parallel:
-                data_parallel_group = self.expt_dp_group
+                data_parallel_group = self.expt_dp_no_egtp_group if is_gtp else self.expt_dp_group
             else:
-                data_parallel_group = self.dp_cp_group
+                data_parallel_group = self.dp_cp_no_gtp_group if is_gtp else self.dp_cp_group
             torch.distributed.broadcast(
                 param.data,
                 src=torch.distributed.get_global_rank(data_parallel_group, 0),
diff --git a/megatron/core/distributed/finalize_model_grads.py b/megatron/core/distributed/finalize_model_grads.py
index 51660d12c7d..053c3902b15 100644
--- a/megatron/core/distributed/finalize_model_grads.py
+++ b/megatron/core/distributed/finalize_model_grads.py
@@ -446,6 +446,49 @@ def _allreduce_non_tensor_model_parallel_grads(
 _allreduce_layernorm_grads = _allreduce_non_tensor_model_parallel_grads
 
 
+def _allreduce_replicated_grads_over_gtp_group(model: List[torch.nn.Module]):
+    """Sum wgrads for replicated parameters over the gtp / egtp group.
+
+    The data-parallel collective already reduced wgrads over the GTP-excluded process groups with
+    1/full scaling, so the gtp-axis terms are still missing. A plain SUM (not AVG) over the gtp/egtp
+    group adds them and yields the exact full mean. No-op when GTP is inactive (group size <= 1).
+    """
+    gtp_group = parallel_state.get_gtp_weight_remat_group(check_initialized=False)
+    egtp_group = parallel_state.get_expert_gtp_weight_remat_group(check_initialized=False)
+
+    dense_params, dense_grads = [], []
+    expert_params, expert_grads = [], []
+    for model_chunk in model:
+        for name, param in get_attr_wrapped_model(model_chunk, 'named_parameters')():
+            if not param.requires_grad or getattr(param, 'is_gtp', False):
+                continue  # GTP-sharded params: their gtp axis is handled by the RS-mean.
+            grad_attr = _get_main_grad_attr(param)
+            grad = getattr(param, grad_attr, None)
+            if grad is None:
+                continue
+            grad = _unshard_if_dtensor(grad)
+            if getattr(param, 'allreduce', True):
+                dense_params.append(param)
+                dense_grads.append(grad.data)
+            else:
+                expert_params.append(param)
+                expert_grads.append(grad.data)
+
+    for params, grads, group in (
+        (dense_params, dense_grads, gtp_group),
+        (expert_params, expert_grads, egtp_group),
+    ):
+        if not grads or group is None or group.size() <= 1:
+            continue
+        coalesced = _flatten_dense_tensors(grads)
+        torch.distributed.all_reduce(coalesced, op=torch.distributed.ReduceOp.SUM, group=group)
+        for param, buf, synced in zip(params, grads, _unflatten_dense_tensors(coalesced, grads)):
+            buf.copy_(synced)
+            grad_attr = _get_main_grad_attr(param)
+            orig_grad = getattr(param, grad_attr)
+            setattr(param, grad_attr, _reshard_if_dtensor(buf, orig_grad))
+
+
 def finalize_model_grads(
     model: List[torch.nn.Module],
     num_tokens: Optional[torch.Tensor] = None,
@@ -495,6 +538,12 @@ def finalize_model_grads(
         pos_emb_group = parallel_state.get_position_embedding_group(check_initialized=False)
         dp_cp_group = parallel_state.get_data_parallel_group(with_context_parallel=True)
 
+    # Fence the current stream against all GTP backward grad work before the DP gradient sync.
+    if config.gtp_weight_remat_size > 1 or config.expert_gtp_weight_remat_size > 1:
+        from megatron.core.tensor_parallel.gtp import wait_for_gtp_grad_reduction_on_current_stream
+
+        wait_for_gtp_grad_reduction_on_current_stream()
+
     # All-reduce / reduce-scatter across DP replicas.
     if config.timers is not None:
         config.timers('all-grads-sync', log_level=1).start(barrier=config.barrier_with_L1_time)
@@ -521,6 +570,8 @@ def finalize_model_grads(
             barrier=config.barrier_with_L1_time
         )
     _allreduce_non_tensor_model_parallel_grads(model, config, tp_group)
+    # Complete the gtp-axis reduction for replicated (non-GTP) params (no-op when GTP inactive).
+    _allreduce_replicated_grads_over_gtp_group(model)
     if config.timers is not None:
         config.timers('non-tensor-parallel-grads-all-reduce').stop()
 
diff --git a/megatron/core/distributed/param_and_grad_buffer.py b/megatron/core/distributed/param_and_grad_buffer.py
index 9051fb9f47e..0d6c7ca1c26 100644
--- a/megatron/core/distributed/param_and_grad_buffer.py
+++ b/megatron/core/distributed/param_and_grad_buffer.py
@@ -1011,6 +1011,7 @@ def __init__(
             param_layout = _compute_default_per_buffer_param_layout(self.params, bucket_size)
         self.param_index_map = param_layout.param_index_map
         self.bucket_indices = param_layout.bucket_indices
+        self.num_optimizer_shards = param_layout.num_optimizer_shards
         per_bucket_numel_unpadded = param_layout.per_bucket_numel_unpadded
 
         # Check if this buffer contains NVFP4 params.
diff --git a/megatron/core/extensions/transformer_engine.py b/megatron/core/extensions/transformer_engine.py
index b7de1013695..50c9d5c655a 100644
--- a/megatron/core/extensions/transformer_engine.py
+++ b/megatron/core/extensions/transformer_engine.py
@@ -19,7 +19,7 @@
 from torch.nn.parameter import Parameter
 from typing_extensions import override
 
-from megatron.core.dist_checkpointing.mapping import ShardedStateDict
+from megatron.core.dist_checkpointing.mapping import ShardedObject, ShardedStateDict
 from megatron.core.dist_checkpointing.utils import replace_prefix_for_sharding
 from megatron.core.enums import Fp4Recipe, Fp8Recipe
 from megatron.core.model_parallel_config import ModelParallelConfig
@@ -381,6 +381,23 @@ def condition_init_method(config, init_method):
     return init_method if config.perform_initialization else (lambda w: None)
 
 
+def _maybe_setup_gtp(module, gtp_group, extra_kwargs):
+    """Wire an active GTP group (size > 1) into TE's extra_kwargs and set module.gtp_size.
+
+    No-op when GTP is inactive (gtp_group is None or size 1), so module.gtp_size stays unset.
+    """
+    if gtp_group is None or gtp_group.size() <= 1:
+        return
+    from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+    assert HAVE_GTP, (
+        "GTP requires TransformerEngine >= 2.17. "
+        "Set MEGATRON_GTP_FORCE_ENABLE=1 to bypass for custom TE builds."
+    )
+    module.gtp_size = get_pg_size(gtp_group)
+    extra_kwargs["gtp_group"] = gtp_group if torch.distributed.is_initialized() else None
+
+
 def split_te_layernorm_column_parallel_linear(
     fused_layer,
     config,
@@ -762,6 +779,7 @@ def __init__(
         symmetric_ar_type: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
     ):
         """
         Args:
@@ -895,6 +913,7 @@ def __init__(
             self.te_quant_params, torch.is_grad_enabled()
         )
 
+        _maybe_setup_gtp(self, gtp_group, extra_kwargs)
         with init_quant_context:
             super().__init__(
                 in_features=input_size,
@@ -1004,6 +1023,7 @@ def __init__(
         skip_weight_param_allocation: bool = False,
         tp_comm_buffer_name: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
         stride: int = 1,
         name: str | None = None,
     ):
@@ -1101,6 +1121,7 @@ def __init__(
             ), "Must have at least TE version 2.3 or higher to use symmetric memory all reduce"
             extra_kwargs["symmetric_ar_type"] = self.config.symmetric_ar_type
 
+        _maybe_setup_gtp(self, gtp_group, extra_kwargs)
         self.stride = stride
 
         self.te_quant_params: Optional[TEQuantizationParams] = None
@@ -1216,7 +1237,7 @@ def extra_repr(self) -> str:
             f"in_features={self.in_features}, "
             f"out_features={self.out_features}, "
             f"bias={self.use_bias}, "
-            f"TP={self.tp_size}"
+            f"TP={self.tp_size}" + (f", GTP={self.gtp_size}" if hasattr(self, "gtp_size") else "")
         )
 
     def backward_dw(self):
@@ -1243,6 +1264,7 @@ def __init__(
         skip_weight_param_allocation: bool = False,
         tp_comm_buffer_name: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
         stride: int = 1,
         name: str | None = None,
     ):
@@ -1282,6 +1304,7 @@ def __init__(
             symmetric_ar_type=config.symmetric_ar_type,
             tp_group=tp_group,
             name=name,
+            gtp_group=gtp_group,
         )
 
         # Set proper partition_stride
@@ -1332,7 +1355,7 @@ def extra_repr(self) -> str:
             f"in_features={self.in_features}, "
             f"out_features={self.out_features}, "
             f"bias={self.use_bias}, "
-            f"TP={self.tp_size}"
+            f"TP={self.tp_size}" + (f", GTP={self.gtp_size}" if hasattr(self, "gtp_size") else "")
         )
 
     def backward_dw(self):
@@ -1488,6 +1511,7 @@ def __init__(
         tp_comm_buffer_name: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
     ):
         """
         Args:
@@ -1525,6 +1549,7 @@ def __init__(
             symmetric_ar_type=config.symmetric_ar_type,
             tp_group=tp_group,
             name=name,
+            gtp_group=gtp_group,
         )
         if config.use_cpu_initialization:
             world_size = get_pg_size(tp_group)
@@ -1571,7 +1596,7 @@ def extra_repr(self) -> str:
             f"in_features={self.in_features}, "
             f"out_features={self.out_features}, "
             f"bias={self.use_bias}, "
-            f"TP={self.tp_size}"
+            f"TP={self.tp_size}" + (f", GTP={self.gtp_size}" if hasattr(self, "gtp_size") else "")
         )
 
     def backward_dw(self):
@@ -1981,6 +2006,7 @@ def __init__(
             self._tp_group = tp_group
             tp_size = get_pg_size(tp_group)
             tp_group_for_te = tp_group
+            gtp_group = pg_collection.expt_gtp
 
             self.explicit_expert_comm = is_expert and (tp_size > 1 or self.expert_parallel)
 
@@ -2000,6 +2026,7 @@ def __init__(
                 tp_size = 1
                 tp_group_for_te = None
 
+            _maybe_setup_gtp(self, gtp_group, extra_kwargs)
             if is_te_min_version("2.14.0"):
                 extra_kwargs["single_grouped_weight"] = getattr(
                     config, "moe_single_grouped_weight", False
@@ -2378,7 +2405,14 @@ def get_gemm_tensor(param_name: str, gemm_idx: int) -> torch.Tensor:
                 )
                 if self.use_bias:
                     sharded_state_dict[f"{prefix}bias{gemm_idx}"] = sub_sd[f"{gemm_idx}.bias"]
-            # Adjust replica ids - replication along DP modulo EP
+            # Set the expert-DP replica_id, picking the group by what EGTP does to each entry:
+            #   - weight ShardedTensor: SHARDED across EGTP (distinct chunks) → not replicas →
+            #     use ``intra_expt_dp_no_egtp``.
+            #   - _extra_state ShardedObject: REPLICATED across EGTP → need distinct replica_ids
+            #     to avoid duplicate-writer collisions → use full ``expt_dp``.
+            # EGTP=1: the two groups coincide, so this is a no-op.
+            expt_dp_full = self._pg_collection.expt_dp
+            expt_dp_intra = self._pg_collection.intra_expt_dp_no_egtp
             for k, sh_ten in sharded_state_dict.items():
                 replica_id = sh_ten.replica_id
                 assert (
@@ -2386,8 +2420,10 @@ def get_gemm_tensor(param_name: str, gemm_idx: int) -> torch.Tensor:
                 ), f"Expected replica_id for {k} to be in (PP, TP, DP) format, got: {replica_id}"
                 if getattr(sh_ten, "is_data_parallel_fully_shard", False):
                     edp_replica_id = 0
+                elif isinstance(sh_ten, ShardedObject):
+                    edp_replica_id = get_pg_rank(expt_dp_full)
                 else:
-                    edp_replica_id = get_pg_rank(self._pg_collection.expt_dp)
+                    edp_replica_id = get_pg_rank(expt_dp_intra)
                 sh_ten.replica_id = (*replica_id[:2], edp_replica_id)
             return sharded_state_dict
 
@@ -2399,6 +2435,16 @@ def backward_dw(self):
             if self.delay_wgrad_compute:
                 super().backward_dw()
 
+        def __repr__(self):
+            gtp_str = f", GTP={self.gtp_size}" if hasattr(self, "gtp_size") else ""
+            return (
+                f"{type(self).__name__}(per expert(["
+                f"in={self.in_features}, out={self.out_features}]) "
+                f"X num_gemms={self.num_gemms}, "
+                f"bias={self.use_bias}, TP={self.tp_size}"
+                f"{gtp_str})"
+            )
+
     class TEColumnParallelGroupedLinear(TEGroupedLinear):
         """
         Wrapper for the Transformer-Engine's `GroupedLinear` layer but specialized
diff --git a/megatron/core/model_parallel_config.py b/megatron/core/model_parallel_config.py
index 88bb070e105..157ae1437f5 100644
--- a/megatron/core/model_parallel_config.py
+++ b/megatron/core/model_parallel_config.py
@@ -7,6 +7,41 @@
 import torch
 
 
+def resolve_tensor_parallel_weight_shards(
+    tensor_model_parallel_size: int,
+    tensor_parallel_num_weight_shards: Optional[int],
+    gtp_weight_remat_size: int,
+    shards_field: str = "tensor_parallel_num_weight_shards",
+    tp_field: str = "tensor_model_parallel_size",
+) -> tuple:
+    """Reconcile ``tensor_parallel_num_weight_shards`` and ``gtp_weight_remat_size``.
+
+    ``tensor_parallel_num_weight_shards`` is the user-facing total number of shards each weight is
+    split into across the tensor-parallel + GTP axes. It is the source of truth and implies
+    ``gtp_weight_remat_size = tensor_parallel_num_weight_shards // tensor_model_parallel_size``.
+    When None it defaults to ``tensor_model_parallel_size * gtp_weight_remat_size`` (so the pair
+    stays consistent, and equals ``tensor_model_parallel_size`` in the no-GTP default). Idempotent.
+
+    Returns the reconciled ``(tensor_parallel_num_weight_shards, gtp_weight_remat_size)``.
+    """
+    tp = tensor_model_parallel_size
+    if tensor_parallel_num_weight_shards is None:
+        tensor_parallel_num_weight_shards = tp * gtp_weight_remat_size
+    else:
+        if tensor_parallel_num_weight_shards < tp:
+            raise ValueError(
+                f"{shards_field} ({tensor_parallel_num_weight_shards}) must be "
+                f">= {tp_field} ({tp})."
+            )
+        if tensor_parallel_num_weight_shards % tp != 0:
+            raise ValueError(
+                f"{shards_field} ({tensor_parallel_num_weight_shards}) must be "
+                f"divisible by {tp_field} ({tp})."
+            )
+        gtp_weight_remat_size = tensor_parallel_num_weight_shards // tp
+    return tensor_parallel_num_weight_shards, gtp_weight_remat_size
+
+
 @dataclass
 class ModelParallelConfig:
     """Base configuration for Megatron Core
@@ -20,6 +55,26 @@ class ModelParallelConfig:
     tensor_model_parallel_size: int = 1
     """Intra-layer model parallelism. Splits tensors across GPU ranks."""
 
+    tensor_parallel_num_weight_shards: Optional[int] = None
+    """Total number of shards each weight is split into across the tensor-parallel + GTP axes
+       (i.e. ``tensor_model_parallel_size * gtp_weight_remat_size``). This is the user-facing knob:
+       it must be ``>= tensor_model_parallel_size`` and divisible by it. When None it defaults to
+       ``tensor_model_parallel_size`` (no GTP sharding). It is the source of truth and implies
+       ``gtp_weight_remat_size = tensor_parallel_num_weight_shards // tensor_model_parallel_size``
+       (resolved in ``__post_init__``).
+    """
+
+    gtp_weight_remat_size: int = 1
+    """Generalized tensor parallelism with weight rematerialization. Shards model weights
+       across GPU ranks along ``out_features``; each weight is rematerialized independently
+       (per-weight, not per-layer) via async all-gather on every forward AND backward pass.
+       Placed right after tensor parallelism in the parallelism ordering.
+
+       INTERNAL / DERIVED — there is no CLI flag for it; do not set directly. It is computed in
+       ``__post_init__`` from ``tensor_parallel_num_weight_shards`` (= that value divided by
+       ``tensor_model_parallel_size``). Use ``tensor_parallel_num_weight_shards`` to control GTP.
+    """
+
     pipeline_model_parallel_comm_backend: Optional[Literal["nccl", "ucc"]] = None
     """Configuring backend option of pipeline parallel communication (e.g., nccl, ucc)
        If None, the default backend will be used.
@@ -77,6 +132,27 @@ class ModelParallelConfig:
        Default is None, which will be set to the value of tensor_model_parallel_size.
     """
 
+    expert_tensor_parallel_num_weight_shards: Optional[int] = None
+    """Total number of shards each expert weight is split into across the expert-tensor-parallel +
+       expert-GTP axes (i.e. ``expert_tensor_parallel_size * expert_gtp_weight_remat_size``). This
+       is the user-facing knob for expert layers: it must be ``>= expert_tensor_parallel_size`` and
+       divisible by it. When None it defaults to ``expert_tensor_parallel_size`` (no expert GTP
+       sharding). It is the source of truth and implies
+       ``expert_gtp_weight_remat_size = expert_tensor_parallel_num_weight_shards //
+       expert_tensor_parallel_size`` (resolved in ``__post_init__``).
+    """
+
+    expert_gtp_weight_remat_size: int = 1
+    """Generalized tensor parallelism with weight rematerialization, for expert layers. Independent
+       from the decoder's ``gtp_weight_remat_size``.
+       Placed right after expert parallelism in the parallelism ordering.
+
+       INTERNAL / DERIVED — there is no CLI flag for it; do not set directly. It is computed in
+       ``__post_init__`` from ``expert_tensor_parallel_num_weight_shards`` (= that value divided by
+       ``expert_tensor_parallel_size``). Use ``expert_tensor_parallel_num_weight_shards`` to control
+       expert GTP.
+    """
+
     ###################
     # Initialization
     ###################
@@ -430,6 +506,24 @@ def __post_init__(self):
         if self.expert_tensor_parallel_size is None:
             self.expert_tensor_parallel_size = self.tensor_model_parallel_size
 
+        # Derive the internal gtp_weight_remat_size from the user-facing
+        # tensor_parallel_num_weight_shards:
+        #   num_weight_shards = tensor_model_parallel_size * gtp_weight_remat
+        _, self.gtp_weight_remat_size = resolve_tensor_parallel_weight_shards(
+            self.tensor_model_parallel_size,
+            self.tensor_parallel_num_weight_shards,
+            self.gtp_weight_remat_size,
+        )
+
+        # Same reconciliation for expert layers (expert_tensor_parallel_size finalized above).
+        _, self.expert_gtp_weight_remat_size = resolve_tensor_parallel_weight_shards(
+            self.expert_tensor_parallel_size,
+            self.expert_tensor_parallel_num_weight_shards,
+            self.expert_gtp_weight_remat_size,
+            shards_field="expert_tensor_parallel_num_weight_shards",
+            tp_field="expert_tensor_parallel_size",
+        )
+
         if self.pipeline_model_parallel_size > 1:
             if self.pipeline_dtype is None:
                 raise ValueError(
diff --git a/megatron/core/models/common/embeddings/language_model_embedding.py b/megatron/core/models/common/embeddings/language_model_embedding.py
index 7e49ec6c02d..ff2e86b3a0d 100644
--- a/megatron/core/models/common/embeddings/language_model_embedding.py
+++ b/megatron/core/models/common/embeddings/language_model_embedding.py
@@ -35,6 +35,7 @@ def __init__(
         num_tokentypes: int = 0,
         scatter_to_sequence_parallel: bool = True,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
     ):
         super().__init__(config=config)
 
@@ -60,6 +61,7 @@ def __init__(
             reduce_scatter_embeddings=self.reduce_scatter_embeddings,
             config=self.config,
             tp_group=self.tp_group,
+            gtp_group=gtp_group,
         )
 
         # Position embedding (serial).
diff --git a/megatron/core/models/hybrid/hybrid_model.py b/megatron/core/models/hybrid/hybrid_model.py
index 1637c9909f1..73be8fc398b 100644
--- a/megatron/core/models/hybrid/hybrid_model.py
+++ b/megatron/core/models/hybrid/hybrid_model.py
@@ -239,6 +239,7 @@ def __init__(
                 position_embedding_type=position_embedding_type,
                 scatter_to_sequence_parallel=scatter_embedding_sequence_parallel,
                 tp_group=self.pg_collection.tp,
+                gtp_group=self.pg_collection.gtp,
             )
 
         # MLA (also used by DeepSeek Sparse Attention) uses its own decoupled RoPE, therefore we do
@@ -322,6 +323,7 @@ def __init__(
                 skip_weight_param_allocation=self.pre_process
                 and self.share_embeddings_and_output_weights,
                 tp_group=self.pg_collection.tp,
+                gtp_group=self.pg_collection.gtp,
             )
 
         if self.pre_process or self.post_process or self.mtp_process:
diff --git a/megatron/core/optimizer/__init__.py b/megatron/core/optimizer/__init__.py
index 27b675d1b8d..e10487a9b34 100644
--- a/megatron/core/optimizer/__init__.py
+++ b/megatron/core/optimizer/__init__.py
@@ -907,13 +907,15 @@ def _get_megatron_emerging_optimizer(
                         "the legacy LayerWise ping-pong path for MoE models."
                     )
                 fallback_config.use_distributed_optimizer = True
+                # Shard optimizer state over the gtp-EXCLUDED replicate group
+                # (intra_dp_cp_no_gtp_group), matching how the DDP grad buffer is partitioned.
                 result = _get_megatron_optimizer_based_on_param_groups(
                     config=fallback_config,
                     model_chunks=model_chunks,
                     param_groups=groups,
                     per_model_buffers=distopt_per_model_buffers,
                     model_parallel_group=distopt_process_groups['mp_group'],
-                    data_parallel_group=distopt_process_groups['intra_dp_cp_group'],
+                    data_parallel_group=distopt_process_groups['intra_dp_cp_no_gtp_group'],
                     data_parallel_group_gloo=distopt_process_groups['intra_dp_cp_group_gloo'],
                     data_parallel_group_idx=get_pg_rank(distopt_process_groups['mp_group']),
                     intra_dist_opt_group=distopt_process_groups['intra_dist_opt_group'],
@@ -1047,13 +1049,33 @@ def get_megatron_optimizer(
 
     dp_cp_group = process_groups_dict['dp_cp_group']
     intra_dp_cp_group = process_groups_dict['intra_dp_cp_group']
-    intra_expt_dp_group = process_groups_dict['intra_expt_dp_group']
+    intra_dp_cp_no_gtp_group = process_groups_dict['intra_dp_cp_no_gtp_group']
+    intra_expt_dp_no_egtp_group = process_groups_dict['intra_expt_dp_no_egtp_group']
     mp_group = process_groups_dict['mp_group']
     expt_tp_pp_group = process_groups_dict['expt_tp_pp_group']
+    expt_tp_pp_with_egtp_group = process_groups_dict['expt_tp_pp_with_egtp_group']
+    expt_dp_group = process_groups_dict['expt_dp_group']
     intra_dp_cp_group_gloo = process_groups_dict['intra_dp_cp_group_gloo']
     intra_expt_dp_group_gloo = process_groups_dict['intra_expt_dp_group_gloo']
     intra_dist_opt_group = process_groups_dict['intra_dist_opt_group']
 
+    # Drives no-Gloo state path + sharding over the *_no_gtp replicate group below.
+    gtp_active = ProcessGroupCollection.is_gtp_active(process_groups_dict)
+    optim_dp_group = intra_dp_cp_no_gtp_group
+    # The gtp-excluded replicate group has no Gloo variant by design (parallel_state asserts it),
+    # so None is correct under GTP. Warn if a Gloo group was requested so the drop is not silent.
+    if gtp_active and intra_dp_cp_group_gloo is not None:
+        log_single_rank(
+            logger,
+            logging.WARNING,
+            "GTP is active: disabling the optimizer's Gloo data-parallel group (no Gloo variant "
+            "of the gtp-excluded replicate group). Use DCP (--ckpt-format torch_dist) for "
+            "checkpointing; the legacy Gloo CPU scatter path is unavailable under GTP.",
+        )
+    optim_dp_group_gloo = None if gtp_active else intra_dp_cp_group_gloo
+    optim_expt_dp_group = intra_expt_dp_no_egtp_group
+
+    # ``mp_group`` spans TP×GTP×PP (GTP-merged).
     model_parallel_rank = get_pg_rank(mp_group)
 
     if get_pg_size(dp_cp_group) > get_pg_size(intra_dp_cp_group):
@@ -1145,7 +1167,7 @@ def get_megatron_optimizer(
                     param_to_param_group[param_name] = param_group_id
                 param_group_id += 1
 
-        # Pass Gloo process groups into optimizer only if needed.
+        # optim_dp_group_gloo was selected above (None when GTP is active; no Gloo path yet).
         optimizers.append(
             _get_megatron_optimizer_based_on_param_groups(
                 config=config,
@@ -1153,8 +1175,8 @@ def get_megatron_optimizer(
                 param_groups=param_groups,
                 per_model_buffers=buffers,
                 model_parallel_group=mp_group,
-                data_parallel_group=intra_dp_cp_group,
-                data_parallel_group_gloo=intra_dp_cp_group_gloo,
+                data_parallel_group=optim_dp_group,
+                data_parallel_group_gloo=optim_dp_group_gloo,
                 data_parallel_group_idx=model_parallel_rank,
                 intra_dist_opt_group=intra_dist_opt_group,
                 distributed_optimizer_instance_id=distributed_optimizer_instance_id,
@@ -1163,6 +1185,9 @@ def get_megatron_optimizer(
         )
         model_chunk_offset += 1
 
+    # Expert params (incl. EGTP shards): reduce over the egtp-EXCLUDED replicate group
+    # (intra_expt_dp_no_egtp_group, which aliases the full expert-DP group when EGTP is
+    # inactive). Backed by expert_parallel_buffers in DDP.
     moe_param_groups, moe_buffers = _get_param_groups_and_buffers(
         model_chunks,
         model_chunk_offset=0,
@@ -1178,9 +1203,15 @@ def get_megatron_optimizer(
                 param_to_param_group[param_name] = param_group_id
             param_group_id += 1
     if len(moe_param_groups) > 0:
-        expt_model_parallel_rank = get_pg_rank(expt_tp_pp_group)
-        # Pass Gloo process groups into optimizer only if needed.
-        if use_gloo_process_groups:
+        # Expert analog of the dense ``model_parallel_rank`` above: the EGTP-merged group gives
+        # each EGTP peer a distinct distopt ShardedObject key. See
+        # docs/api-guide/core/generalized_tensor_parallel.md §3.3 (Optimizer state) for why
+        # the non-merged ``expt_tp_pp_group`` would cause a DCP "duplicate" error.
+        expt_model_parallel_rank = get_pg_rank(expt_tp_pp_with_egtp_group)
+        # Gloo expert-DP group for the optimizer, only when (E)GTP is inactive. When active the
+        # optimizer shards over the egtp-EXCLUDED (no_egtp) replicate group, which has no Gloo
+        # variant yet, so pass None (mirrors the dense optim_dp_group_gloo above).
+        if use_gloo_process_groups and not gtp_active:
             expt_data_parallel_group_gloo = intra_expt_dp_group_gloo
         else:
             expt_data_parallel_group_gloo = None
@@ -1190,8 +1221,8 @@ def get_megatron_optimizer(
                 model_chunks=model_chunks,
                 param_groups=moe_param_groups,
                 per_model_buffers=moe_buffers,
-                model_parallel_group=expt_tp_pp_group,
-                data_parallel_group=intra_expt_dp_group,
+                model_parallel_group=expt_tp_pp_with_egtp_group,
+                data_parallel_group=optim_expt_dp_group,
                 data_parallel_group_gloo=expt_data_parallel_group_gloo,
                 data_parallel_group_idx=expt_model_parallel_rank,
                 intra_dist_opt_group=intra_dist_opt_group,
diff --git a/megatron/core/optimizer/clip_grads.py b/megatron/core/optimizer/clip_grads.py
index 3c5491d39a1..e0347d7047f 100644
--- a/megatron/core/optimizer/clip_grads.py
+++ b/megatron/core/optimizer/clip_grads.py
@@ -47,7 +47,7 @@
         multi_tensor_scale_tensor_impl = None
 
 
-from ..tensor_parallel import param_is_not_tensor_parallel_duplicate
+from ..tensor_parallel import param_is_not_gtp_duplicate, param_is_not_tensor_parallel_duplicate
 from ..transformer.module import param_is_not_shared
 from ..utils import get_data_parallel_group_if_dtensor, to_local_if_dtensor
 
@@ -206,9 +206,9 @@ def count_zeros_fp32(
 
     The count is performed in FP32. This method filters parameters to ensure
     gradients are not double-counted by checking if the gradient is not None,
-    the parameter is not shared, and the parameter is not a replica due
-    to tensor model parallelism. It also handles parameters managed by
-    Megatron FSDP specifically.
+    the parameter is not shared, and the parameter is not a replica due to
+    tensor model parallelism or (expert) generalized tensor parallelism. It also
+    handles parameters managed by Megatron FSDP specifically.
 
     Args:
         parameters (Union[List[torch.Tensor], torch.Tensor]): An iterable of
@@ -218,6 +218,8 @@ def count_zeros_fp32(
         use_decoupled_grad (bool, optional): If True, reads from the
             '.decoupled_grad' attribute instead of the standard '.grad'.
             Defaults to False.
+        tp_group (ProcessGroup, optional): TP group for the TP-duplicate filter.
+            Defaults to the default TP group.
 
     Returns:
         float: The total number of zeros in the gradients across the process group.
@@ -230,6 +232,7 @@ def count_zeros_fp32(
     #   - grad should not be none
     #   - parameter should not be shared
     #   - should not be a replica due to tensor model parallelism
+    #   - should not be a replica due to (expert) generalized tensor parallelism
     total_num_zeros = torch.zeros(1, dtype=torch.int64, device='cuda')
     data_parallel_group = None
     use_megatron_fsdp = False
@@ -246,7 +249,8 @@ def count_zeros_fp32(
             continue
         is_not_shared = param_is_not_shared(param)
         is_not_tp_duplicate = param_is_not_tensor_parallel_duplicate(param, tp_group=tp_group)
-        if grad_not_none and is_not_shared and is_not_tp_duplicate:
+        is_not_gtp_duplicate = param_is_not_gtp_duplicate(param)
+        if grad_not_none and is_not_shared and is_not_tp_duplicate and is_not_gtp_duplicate:
             grad_obj = getattr(param, grad_attr)
             data_parallel_group = get_data_parallel_group_if_dtensor(grad_obj, data_parallel_group)
             grad = to_local_if_dtensor(grad_obj).detach()
diff --git a/megatron/core/optimizer/distrib_optimizer.py b/megatron/core/optimizer/distrib_optimizer.py
index 374b1aab096..a7bfbd14d20 100644
--- a/megatron/core/optimizer/distrib_optimizer.py
+++ b/megatron/core/optimizer/distrib_optimizer.py
@@ -203,7 +203,11 @@ def _build_model_gbuf_range(cls, param_and_grad_buffer: _ParamAndGradBuffer, buc
         """
 
         data_parallel_rank = param_and_grad_buffer.data_parallel_group.rank()
-        data_parallel_world_size = param_and_grad_buffer.data_parallel_group.size()
+        data_parallel_world_size = (
+            param_and_grad_buffer.num_optimizer_shards
+            if param_and_grad_buffer.num_optimizer_shards is not None
+            else param_and_grad_buffer.data_parallel_group.size()
+        )
 
         bucket = param_and_grad_buffer.buckets[bucket_index]
         gbuf_size = bucket.grad_data.numel()
@@ -397,6 +401,7 @@ def _build_model_and_main_param_groups(
                         tensor_parallel.copy_tensor_model_parallel_attributes(
                             shard_model_param, model_param
                         )
+                        tensor_parallel.copy_gtp_attributes(shard_model_param, model_param)
                         copy_optimizer_param_metadata(shard_model_param, model_param)
 
                     # Generate main param.
@@ -428,6 +433,7 @@ def _build_model_and_main_param_groups(
                         tensor_parallel.copy_tensor_model_parallel_attributes(
                             shard_main_param, model_param
                         )
+                        tensor_parallel.copy_gtp_attributes(shard_main_param, model_param)
                         copy_optimizer_param_metadata(shard_main_param, model_param)
                     else:
                         # When using precision-aware optimizer, main params are held by FusedAdam.
@@ -450,6 +456,7 @@ def _build_model_and_main_param_groups(
                     tensor_parallel.copy_tensor_model_parallel_attributes(
                         shard_model_param, model_param
                     )
+                    tensor_parallel.copy_gtp_attributes(shard_model_param, model_param)
                     copy_optimizer_param_metadata(shard_model_param, model_param)
 
                 else:
@@ -563,6 +570,7 @@ def _finalize_bucket(param_end_index, bucket_start_index, bucket_id):
             bucket_indices=bucket_indices,
             per_bucket_numel_unpadded=per_bucket_numel_unpadded,
             param_indices=param_indices if param_indices is not None else [],
+            num_optimizer_shards=data_parallel_world_size,
         )
 
     @staticmethod
diff --git a/megatron/core/optimizer/emerging_optimizers.py b/megatron/core/optimizer/emerging_optimizers.py
index 4dfed6199b3..f775d0accd0 100644
--- a/megatron/core/optimizer/emerging_optimizers.py
+++ b/megatron/core/optimizer/emerging_optimizers.py
@@ -17,7 +17,7 @@
 from torch.optim.optimizer import ParamsT
 
 from megatron.core.process_groups_config import ProcessGroupCollection
-from megatron.core.utils import get_pg_size, log_single_rank
+from megatron.core.utils import get_pg_rank, get_pg_size, log_single_rank
 
 from .optimizer_config import ParamKey, ParamPredicate
 
@@ -230,6 +230,35 @@ def scaled_orthogonalize_fn(
             scaled_orthogonalize_fn=scaled_orthogonalize_fn,
         )
 
+    def scaled_orthogonalize_fn_with_gtp(self, p, grad, tp_group, partition_dim):
+        """All-gather grad along GTP/EGTP dim 0, orthogonalize, then slice back.
+
+        GTP shards weights along dim 0 independently of TP's partition_dim. Newton-Schulz
+        needs the full weight matrix, so we reconstruct the GTP dimension before running
+        the TP-aware orthogonalization, then extract the local GTP shard from the result.
+        When GTP is inactive this is a plain passthrough to scaled_orthogonalize_fn.
+        """
+        is_expert = getattr(p, 'expert_tp', False)
+        gtp_group = (
+            (self.pg_collection.expt_gtp if is_expert else self.pg_collection.gtp)
+            if self.pg_collection
+            else None
+        )
+
+        if gtp_group is None or get_pg_size(gtp_group) <= 1:
+            return self.scaled_orthogonalize_fn(grad, tp_group, partition_dim)
+
+        gtp_size = get_pg_size(gtp_group)
+        gtp_rank = get_pg_rank(gtp_group)
+        shards = [torch.empty_like(grad) for _ in range(gtp_size)]
+        torch.distributed.all_gather(shards, grad, gtp_group)
+        gathered_grad = torch.cat(shards, dim=0)
+
+        gathered_grad = self.scaled_orthogonalize_fn(gathered_grad, tp_group, partition_dim)
+
+        shard_size = gathered_grad.shape[0] // gtp_size
+        return gathered_grad[gtp_rank * shard_size : (gtp_rank + 1) * shard_size].contiguous()
+
     def orthogonalize(self, p: torch.Tensor, grad: torch.Tensor, **kwargs: Any) -> torch.Tensor:
         """Orthogonalize the momentum.
 
@@ -280,14 +309,14 @@ def orthogonalize(self, p: torch.Tensor, grad: torch.Tensor, **kwargs: Any) -> t
             qkv_grads = [g.reshape(-1, grad_shape[-1]) for g in qkv_grads]
 
             qkv_grads = [
-                self.scaled_orthogonalize_fn(g, tp_group, partition_dim).view(
+                self.scaled_orthogonalize_fn_with_gtp(p, g, tp_group, partition_dim).view(
                     num_query_groups, -1, grad_shape[-1]
                 )
                 for g in qkv_grads
             ]
             grad = torch.cat(qkv_grads, dim=1).view(grad_shape)
         else:
-            grad = self.scaled_orthogonalize_fn(grad, tp_group, partition_dim)
+            grad = self.scaled_orthogonalize_fn_with_gtp(p, grad, tp_group, partition_dim)
         return grad
 
 
diff --git a/megatron/core/optimizer/layer_wise_optimizer.py b/megatron/core/optimizer/layer_wise_optimizer.py
index 606525f8097..48e73a5a3f0 100644
--- a/megatron/core/optimizer/layer_wise_optimizer.py
+++ b/megatron/core/optimizer/layer_wise_optimizer.py
@@ -2,6 +2,7 @@
 
 import logging
 import math
+import re
 from typing import Callable, Dict, List, Optional, Tuple
 
 import torch
@@ -86,6 +87,84 @@ def tag_params_for_buffer_routing(model_chunks) -> None:
             param.is_managed_by_layer_wise_optimizer = is_managed_by_layer_wise_optimizer(param)
 
 
+def _build_gtp_replica_fold(pg_collection, model_chunks) -> Dict[str, Tuple[int, int]]:
+    """Map each (E)GTP-REPLICATED param's name to ``(gtp_rank, gtp_size)`` for replica_id folding.
+
+    PROBLEM: LayerWise keeps (E)GTP-replicated params (identical on every (e)gtp peer) WHOLE, so
+    their optimizer-state ShardedTensors share one key+offset across those peers. The DP-coord reset
+    in ``sharded_state_dict`` would then mark all peers the all-zero "main replica" -> DCP sees N
+    writers for one shard and rejects the save.
+
+    FIX: fold the (e)gtp rank into ``replica_id[1]`` so exactly one peer writes. (E)GTP-SHARDED
+    params (``GTPShardedParam``) are offset-sharded and excluded -- each shard already has a
+    distinct offset, hence a unique writer.
+
+    Returns: ``{param_name: (gtp_rank, gtp_size)}``, empty (no folding) when GTP is unavailable or
+    no group spans >1 rank. Names are bare (all ``module.`` wrappers stripped, layer index
+    collapsed) to match the optimizer-state checkpoint key suffix.
+    """
+    gtp_fold: Dict[str, Tuple[int, int]] = {}
+    try:
+        from megatron.core.tensor_parallel.gtp import HAVE_GTP, GTPShardedParam
+    except ImportError:
+        return gtp_fold
+    if not HAVE_GTP:
+        return gtp_fold
+
+    from megatron.core import parallel_state
+
+    # Source the (e)gtp groups from pg_collection if populated, else from parallel_state
+    # (the default pg_collection leaves gtp/expt_gtp unset). Compatibility point.
+    gtp_group = getattr(pg_collection, 'gtp', None) if pg_collection else None
+    if gtp_group is None:
+        gtp_group = parallel_state.get_gtp_weight_remat_group(check_initialized=False)
+    egtp_group = getattr(pg_collection, 'expt_gtp', None) if pg_collection else None
+    if egtp_group is None:
+        egtp_group = parallel_state.get_expert_gtp_weight_remat_group(check_initialized=False)
+
+    for model_chunk in model_chunks:
+        for name, p in model_chunk.named_parameters():
+            if isinstance(p, GTPShardedParam):
+                continue
+            grp = egtp_group if getattr(p, 'is_expert_parallel', False) else gtp_group
+            if grp is None or grp.size() <= 1:
+                continue
+            # Normalize the param name so it matches the optimizer-state checkpoint key suffix,
+            # which is wrapper-free and layer-collapsed. Two transforms, in order:
+            #   1. drop every leading 'module.' (DDP + Float16Module can double-wrap the model), and
+            #   2. collapse the layer index (the checkpoint key drops it -- it is a sharded axis).
+            # e.g.  'module.module.decoder.layers.3.mlp.router.weight'
+            #         -> 'decoder.layers.mlp.router.weight'
+            nm = name
+            while nm.startswith('module.'):
+                nm = nm[len('module.') :]
+            nm = re.sub(r'\.layers\.\d+\.', '.layers.', nm)
+            gtp_fold[nm] = (grp.rank(), grp.size())
+    return gtp_fold
+
+
+def _fold_replica_id(replica_id, key, gtp_fold: Dict[str, Tuple[int, int]]):
+    """Compute a ShardedTensor's writer-disambiguating replica_id for fixed-DP checkpointing.
+
+    Base reset: keep (PP, TP), zero DP -- every DP rank holds the same shard, so one writer
+    remains. Correct for normal params.
+
+    For an (e)gtp-replicated param (one in ``gtp_fold``), the reset leaves ``gtp_size`` writers, so
+    fold the peer's gtp rank into the TP slot to re-spread them: ``new_tp = old_tp * gtp_size +
+    gtp_rank`` (rank 0 stays the writer, the others move off the all-zero main replica) -> one
+    writer per shard. Suffix-match (bare fold name vs fully-qualified key) and collapse the key's
+    layer index too, so it matches per-layer and already-collapsed keys.
+    """
+    rid = (*replica_id[:2], 0)
+    if not gtp_fold:
+        return rid
+    key = re.sub(r'\.layers\.\d+\.', '.layers.', key or '')
+    for nm, (gtp_rank, gtp_size) in gtp_fold.items():
+        if key.endswith(nm):
+            return (rid[0], rid[1] * gtp_size + gtp_rank, rid[2])
+    return rid
+
+
 class LayerWiseDistributedOptimizer(ChainedOptimizer):
     """Layer-wise distributed optimizer for Megatron-core models.
 
@@ -288,6 +367,7 @@ def _emit_bucket(
             bucket_indices=bucket_indices,
             per_bucket_numel_unpadded=per_bucket_numel_unpadded,
             param_indices=param_indices if param_indices is not None else [],
+            num_optimizer_shards=dp_size,
         )
 
     @staticmethod
@@ -366,6 +446,16 @@ def __init__(
 
         self.pg_collection = pg_collection
 
+        # Use GTP/EGTP-excluded DP groups for layer-wise sharding and all-gather so that
+        # only true weight replicas are sharded across; GTP's own all-gather / reduce-scatter
+        # handles the GTP axis separately. Falls back to the full groups when GTP is inactive.
+        if pg_collection is not None:
+            self.dp_cp_group = pg_collection.dp_cp_no_gtp or pg_collection.dp_cp
+            self.expt_dp_group = pg_collection.expt_dp_no_egtp or pg_collection.expt_dp
+        else:
+            self.dp_cp_group = None
+            self.expt_dp_group = None
+
         full_param_layouts = None
         if model_chunks is not None:
             full_param_layouts = [
@@ -448,13 +538,13 @@ def shard_params(self, optimizers, full_param_layouts=None):
                 chunk).  ``None`` triggers the legacy fallback.
         """
         # Simplify when dp_cp group size is 1.
-        dp_cp_size = get_pg_size(self.pg_collection.dp_cp)
+        dp_cp_size = get_pg_size(self.dp_cp_group)
         if dp_cp_size == 1:
             self.dp_cp_params_list = None
             self.expt_dp_params_list = None
             return
 
-        expt_dp_size = get_pg_size(self.pg_collection.expt_dp)
+        expt_dp_size = get_pg_size(self.expt_dp_group)
 
         if full_param_layouts is not None:
             self._shard_params_from_layout(optimizers, full_param_layouts, dp_cp_size, expt_dp_size)
@@ -463,8 +553,8 @@ def shard_params(self, optimizers, full_param_layouts=None):
 
     def _shard_params_from_layout(self, optimizers, full_param_layouts, dp_cp_size, expt_dp_size):
         """Derive shard assignments from the param layout."""
-        dp_cp_rank = get_pg_rank(self.pg_collection.dp_cp)
-        expt_dp_rank = get_pg_rank(self.pg_collection.expt_dp)
+        dp_cp_rank = get_pg_rank(self.dp_cp_group)
+        expt_dp_rank = get_pg_rank(self.expt_dp_group)
 
         self.dp_cp_params_list = [[] for _ in range(dp_cp_size)]
         self.expt_dp_params_list = [[] for _ in range(expt_dp_size)]
@@ -478,14 +568,15 @@ def _shard_params_from_layout(self, optimizers, full_param_layouts, dp_cp_size,
                 # separate DistributedOptimizer; LayerWise does not own them.
                 if not buffer_key.is_managed_by_layer_wise_optimizer:
                     continue
-                dp_size = expt_dp_size if buffer_key.is_expert_parallel else dp_cp_size
                 for param, (
                     param_start_index,
                     param_end_index,
                     bucket_id,
                 ) in layout.param_index_map.items():
                     bucket_start_index, bucket_end_index = layout.bucket_indices[bucket_id]
-                    shard_size = (bucket_end_index - bucket_start_index) // dp_size
+                    shard_size = (
+                        bucket_end_index - bucket_start_index
+                    ) // layout.num_optimizer_shards
                     shard_id = (param_start_index - bucket_start_index) // shard_size
                     shard_end_index = bucket_start_index + (shard_id + 1) * shard_size
                     assert param_end_index <= shard_end_index, (
@@ -561,12 +652,12 @@ def _shard_params_ping_pong(self, optimizers, dp_cp_size, expt_dp_size):
         # Assign params to rank in ping-pong style loop.
         for p, group_index in param_list:
             if param_groups[group_index].get("is_expert_parallel", False):
-                if expt_dp_loop[expt_dp_idx] == get_pg_rank(self.pg_collection.expt_dp):
+                if expt_dp_loop[expt_dp_idx] == get_pg_rank(self.expt_dp_group):
                     param_groups_this_rank[group_index].append(p)
                 self.expt_dp_params_list[expt_dp_loop[expt_dp_idx]].append(p)
                 expt_dp_idx = (expt_dp_idx + 1) % len(expt_dp_loop)
             else:
-                if dp_cp_loop[dp_cp_idx] == get_pg_rank(self.pg_collection.dp_cp):
+                if dp_cp_loop[dp_cp_idx] == get_pg_rank(self.dp_cp_group):
                     param_groups_this_rank[group_index].append(p)
                 self.dp_cp_params_list[dp_cp_loop[dp_cp_idx]].append(p)
                 dp_cp_idx = (dp_cp_idx + 1) % len(dp_cp_loop)
@@ -599,7 +690,7 @@ def set_bucket_layerwise_params_list(self, model_chunks):
                 for bucket in group.buckets:
                     if not _bucket_is_managed_by_layer_wise_optimizer(bucket):
                         continue
-                    bucket_params_list = [[] for _ in range(get_pg_size(self.pg_collection.dp_cp))]
+                    bucket_params_list = [[] for _ in range(get_pg_size(self.dp_cp_group))]
                     for bucket_list, full_params_list in zip(
                         bucket_params_list, self.dp_cp_params_list
                     ):
@@ -613,9 +704,7 @@ def set_bucket_layerwise_params_list(self, model_chunks):
                     if not _bucket_is_managed_by_layer_wise_optimizer(bucket):
                         continue
                     if self.expt_dp_params_list is not None:
-                        bucket_params_list = [
-                            [] for _ in range(get_pg_size(self.pg_collection.expt_dp))
-                        ]
+                        bucket_params_list = [[] for _ in range(get_pg_size(self.expt_dp_group))]
                         for bucket_list, full_params_list in zip(
                             bucket_params_list, self.expt_dp_params_list
                         ):
@@ -678,9 +767,9 @@ def _allgather_helper(params_list, group):
         if self.pg_collection is None:
             return
         if self.dp_cp_params_list:
-            _allgather_helper(self.dp_cp_params_list, self.pg_collection.dp_cp)
+            _allgather_helper(self.dp_cp_params_list, self.dp_cp_group)
         if self.expt_dp_params_list:
-            _allgather_helper(self.expt_dp_params_list, self.pg_collection.expt_dp)
+            _allgather_helper(self.expt_dp_params_list, self.expt_dp_group)
 
     @torch.no_grad()
     def broadcast_params(self):
@@ -689,15 +778,15 @@ def broadcast_params(self):
         if self.dp_cp_params_list is None:
             return
         for i, params in enumerate(self.dp_cp_params_list):
-            src_global_rank = torch.distributed.get_global_rank(self.pg_collection.dp_cp, i)
+            src_global_rank = torch.distributed.get_global_rank(self.dp_cp_group, i)
             for p in params:
-                torch.distributed.broadcast(p, src_global_rank, self.pg_collection.dp_cp)
+                torch.distributed.broadcast(p, src_global_rank, self.dp_cp_group)
         if self.expt_dp_params_list is None:
             return
         for i, params in enumerate(self.expt_dp_params_list):
-            src_global_rank = torch.distributed.get_global_rank(self.pg_collection.expt_dp, i)
+            src_global_rank = torch.distributed.get_global_rank(self.expt_dp_group, i)
             for p in params:
-                torch.distributed.broadcast(p, src_global_rank, self.pg_collection.expt_dp)
+                torch.distributed.broadcast(p, src_global_rank, self.expt_dp_group)
 
     @torch.no_grad()
     def get_grad_norm(self):
@@ -830,14 +919,20 @@ def sharded_state_dict(
             model_sharded_state_dict, is_loading, **kwargs
         )
 
+        # (E)GTP-replicated-param -> (gtp_rank, gtp_size), consumed by _fold_replica_id below.
+        gtp_fold = _build_gtp_replica_fold(self.pg_collection, self.model_chunks)
+
         # for fixed DP usage only
         for sh_base in nested_values(sharded_state_dict):
             if hasattr(sh_base, 'replica_id'):
                 assert (
                     isinstance(sh_base.replica_id, int) or len(sh_base.replica_id) == 3
                 ), f'Expected replica_id as int or (PP, TP, DP), got: {sh_base}'
-                sh_base.replica_id = (
-                    0 if isinstance(sh_base.replica_id, int) else (*sh_base.replica_id[:2], 0)
+                if isinstance(sh_base.replica_id, int):
+                    sh_base.replica_id = 0
+                    continue
+                sh_base.replica_id = _fold_replica_id(
+                    sh_base.replica_id, getattr(sh_base, 'key', ''), gtp_fold
                 )
 
         # later code assume list but chained optimizer fallback to non-list if there's only one
diff --git a/megatron/core/optimizer/optimizer.py b/megatron/core/optimizer/optimizer.py
index e03992e0657..880e7a71ce8 100644
--- a/megatron/core/optimizer/optimizer.py
+++ b/megatron/core/optimizer/optimizer.py
@@ -183,6 +183,7 @@ def _filter_grads_for_norm(
           - parameter should not be shared (i.e., grads shouldn't be double counted while
             computing norms).
           - should not be a replica due to tensor model parallelism.
+          - should not be a replica due to (expert) generalized tensor parallelism.
         """
         grads_for_norm = []
         for param in params:
@@ -211,7 +212,9 @@ def _filter_grads_for_norm(
             is_not_tp_duplicate = tensor_parallel.param_is_not_tensor_parallel_duplicate(
                 param, getattr(self, 'tp_group', None)
             )
-            if grad_not_none and is_not_shared and is_not_tp_duplicate:
+            is_not_gtp_duplicate = tensor_parallel.param_is_not_gtp_duplicate(param)
+
+            if grad_not_none and is_not_shared and is_not_tp_duplicate and is_not_gtp_duplicate:
                 grads_for_norm.append(grad)
         return grads_for_norm
 
@@ -772,6 +775,56 @@ def step(self):
         return success, grad_norm, num_zeros_in_grad
 
 
+def _backfill_gtp_sharded_param_map(id_to_sharded_param_map: dict, float16_groups) -> None:
+    """Backfill the optimizer id->ShardedTensor map with GTP shards it is missing (in place).
+
+    WHAT: ``get_param_id_to_sharded_param_map`` matches an optimizer param to its model
+    ShardedTensor by object identity (``id(model_entry.data) == id(optim_param)``). A GTP weight
+    whose model entry is a gathered+split factory (Mamba ``in_proj``) exposes the *gathered* tensor,
+    not the per-shard ``GTPShardedParam``, so it fails to match and is absent from the map -- the
+    generic conversion below would then KeyError on it. This backfills the same per-shard
+    ShardedTensor every other GTP weight already gets, so its optimizer state is saved per-shard.
+
+    WHEN: only the distributed-Muon path reaches here. ``LayerWiseDistributedOptimizer`` keeps such
+    matrix params whole and routes them through this ``Float16OptimizerWithFloat16Params``.
+    Distributed Adam uses its own ``DistributedOptimizer.sharded_state_dict`` (flat-buffer path)
+    and is unaffected.
+
+    No-op when GTP is unavailable or when every param already matched.
+    """
+    try:
+        from megatron.core import parallel_state
+        from megatron.core.tensor_parallel.gtp import (
+            GTPShardedParam,
+            make_sharded_tensors_for_checkpoint_with_gtp,
+        )
+    except ImportError:
+        return  # GTP not built in -- nothing to backfill.
+
+    # Groups sourced lazily (below) only when a GTP param is found, so GTP-free models on
+    # explicit grids (e.g. MiMo) never require the global MPU groups to be initialized.
+    tp_group = None
+    dp_cp_group = None
+    for param_id, p in enumerate(chain.from_iterable(float16_groups)):
+        # Skip params that already matched, and any non-GTP param (those always match).
+        if param_id in id_to_sharded_param_map or not isinstance(p, GTPShardedParam):
+            continue
+        if tp_group is None:
+            tp_group = parallel_state.get_tensor_model_parallel_group()
+            dp_cp_group = parallel_state.get_data_parallel_group(with_context_parallel=True)
+        # Key by the param's dotted name (set in prod by tag_gtp_params_with_names); the fallback
+        # keeps the function usable in tests where the name was not tagged.
+        key = p._debug_name or f'_gtp_optim_param_{param_id}'
+        entry = make_sharded_tensors_for_checkpoint_with_gtp(
+            {key: p},
+            prefix='',
+            tensor_parallel_layers_axis_map={key: 0},
+            tp_group=tp_group,
+            dp_cp_group=dp_cp_group,
+        )
+        id_to_sharded_param_map[param_id] = entry[key]
+
+
 class Float16OptimizerWithFloat16Params(MixedPrecisionOptimizer):
     """Float16 optimizer for fp16 and bf16 data types.
 
@@ -823,6 +876,7 @@ def __init__(
                             main_param = param.detach().clone().float()
                             # Copy tensor model parallel attributes.
                             tensor_parallel.copy_tensor_model_parallel_attributes(main_param, param)
+                            tensor_parallel.copy_gtp_attributes(main_param, param)
                             copy_optimizer_param_metadata(main_param, param)
                             # Replace the optimizer params with the new fp32 copy.
                             param_group['params'][i] = main_param
@@ -961,6 +1015,8 @@ def sharded_state_dict(
             model_sharded_state_dict, chain.from_iterable(g for g in self.float16_groups)
         )
 
+        _backfill_gtp_sharded_param_map(id_to_sharded_param_map, self.float16_groups)
+
         # Convert fp32_from_fp16_params
         assert len(state_dict['fp32_from_fp16_params']) == len(
             state_dict['optimizer']['param_groups']
diff --git a/megatron/core/optimizer/param_layout.py b/megatron/core/optimizer/param_layout.py
index 2ee511c6126..9d2dd4db365 100644
--- a/megatron/core/optimizer/param_layout.py
+++ b/megatron/core/optimizer/param_layout.py
@@ -11,7 +11,7 @@
 
 import math
 from dataclasses import dataclass, field
-from typing import Dict, List, Tuple
+from typing import Dict, List, Optional, Tuple
 
 import torch
 
@@ -79,12 +79,16 @@ class PerBufferParamLayout:
         param_indices: The index of each param among same-dtype params (using the "fake"
             high-precision dtype for FP8/NVFP4 params). Needed for loading non-native-fp8
             checkpoints in native-fp8 mode. Order matches param_index_map iteration order.
+        num_optimizer_shards: Number of optimizer shards. Set by the distributed optimizer
+            that computes the layout so that shard assignment at runtime uses the same
+            value. ``None`` for non-distributed-optimizer layouts.
     """
 
     param_index_map: Dict[torch.nn.Parameter, Tuple[int, int, int]] = field(default_factory=dict)
     bucket_indices: List[Tuple[int, int]] = field(default_factory=list)
     per_bucket_numel_unpadded: List[int] = field(default_factory=list)
     param_indices: List[int] = field(default_factory=list)
+    num_optimizer_shards: Optional[int] = None
 
 
 @dataclass
diff --git a/megatron/core/parallel_state.py b/megatron/core/parallel_state.py
index 337485b4d12..8509ec66f41 100644
--- a/megatron/core/parallel_state.py
+++ b/megatron/core/parallel_state.py
@@ -27,6 +27,9 @@
 
 # Intra-layer model parallel group that the current rank belongs to.
 _TENSOR_MODEL_PARALLEL_GROUP = None
+# Generalized tensor parallelism group that the current rank belongs to.
+_GTP_WEIGHT_REMAT_GROUP = None
+_GTP_WEIGHT_REMAT_GLOBAL_RANKS = None
 # Inter-layer model parallel group that the current rank belongs to.
 _PIPELINE_MODEL_PARALLEL_GROUP = None
 # Model parallel group (both intra- and pipeline) that the current rank belongs to.
@@ -50,6 +53,9 @@
 # _EXPERT_TENSOR denotes tensor parallelism of expert which splits tensor across the group.
 # _EXPERT_DATA denotes data parallelism of expert which replicates weight across the group.
 
+# Expert generalized tensor parallelism group that current rank belongs to.
+_EXPERT_GTP_WEIGHT_REMAT_GROUP = None
+_EXPERT_GTP_WEIGHT_REMAT_GLOBAL_RANKS = None
 # Expert model parallel group that current rank belongs to.
 _EXPERT_MODEL_PARALLEL_GROUP = None
 # Expert tensor parallel group that current rank belongs to.
@@ -58,12 +64,20 @@
 _EXPERT_TENSOR_AND_MODEL_PARALLEL_GROUP = None
 # Expert tensor, model, pipeline combined parallel group
 _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP = None
+# Same as above, but additionally merged across EGTP peers (analog of dense _MODEL_PARALLEL_GROUP
+# under GTP). Identical to the above when EGTP=1.
+_EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP = None
 # Expert data parallel group
 _EXPERT_DATA_PARALLEL_GROUP = None
+_EXPERT_DATA_PARALLEL_GROUP_NO_GTP = None
 _EXPERT_DATA_PARALLEL_GROUP_GLOO = None
 _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP = None
 _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_GLOO = None
 _INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP = None
+# Partial expert DP group with EGTP peers excluded — per-distopt-instance slice
+# of true expert-weight replicas. Mirrors _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP
+# on the dense side.
+_INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_NO_GTP = None
 # Parallel state values changed on the fly
 _MPU_EXPERT_MODEL_PARALLEL_WORLD_SIZE = None
 _MPU_EXPERT_MODEL_PARALLEL_RANK = None
@@ -118,6 +132,10 @@
 # Hybrid context parallel groups
 _HYBRID_DP_CP_GROUPS = {}
 
+# Data parallel group information with generalized tensor parallel accounted for.
+_DATA_PARALLEL_GROUP_NO_GTP = None
+_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = None
+
 # Data parallel group information with context parallel combined.
 _DATA_PARALLEL_GROUP_WITH_CP = None
 _DATA_PARALLEL_GROUP_WITH_CP_GLOO = None
@@ -127,6 +145,10 @@
 _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP = None
 _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_GLOO = None
 
+# Partial Data parallel group information with context parallel combined and GTP peers
+# excluded. Reaches only true weight-replica ranks within one distributed-optimizer instance.
+_INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = None
+
 # combined parallel group of TP and CP
 _TENSOR_AND_CONTEXT_PARALLEL_GROUP = None
 
@@ -447,7 +469,15 @@ class RankGenerator(object):
     """A class for generating rank groups for different modes of parallelism."""
 
     def __init__(
-        self, tp: int, ep: int, dp: int, pp: int, cp: int, order: str, rank_offset: int = 0
+        self,
+        tp: int,
+        ep: int,
+        dp: int,
+        pp: int,
+        cp: int,
+        order: str,
+        rank_offset: int = 0,
+        gtp: int = 1,
     ) -> None:
         assert (
             ep == 1 or cp == 1
@@ -459,8 +489,11 @@ def __init__(
         self.dp = dp
         self.pp = pp
         self.cp = cp
+        # gtp is a genuine world_size factor; gtp=1 (default) is a size-1 identity dim,
+        # leaving world_size and all rank groups unchanged for non-GTP callers.
+        self.gtp = gtp
         self.rank_offset = rank_offset
-        self.world_size = tp * dp * pp * cp * ep
+        self.world_size = tp * dp * pp * cp * ep * gtp
 
         self.name_to_size = {
             "tp": self.tp,
@@ -468,6 +501,7 @@ def __init__(
             "dp": self.dp,
             "ep": self.ep,
             "cp": self.cp,
+            "gtp": self.gtp,
         }
         self.order = order
         order = order.lower()
@@ -520,6 +554,13 @@ def get_ranks(self, token):
                     rank_group[i] += self.rank_offset
         return ranks
 
+    def get_gtp_ranks(self, gtp_size: int):
+        """Get the GTP weight-sharding groups (singletons when ``gtp_size == 1``)."""
+        assert (
+            self.gtp == gtp_size
+        ), f"gtp axis size ({self.gtp}) != requested gtp_size ({gtp_size})"
+        return self.get_ranks('gtp')
+
 
 def default_embedding_ranks(pp_ranks):
     """Return the default ranks that constitute the stages on which the word embeddings live.
@@ -554,6 +595,8 @@ def initialize_model_parallel(
     hierarchical_context_parallel_sizes: Optional[List[int]] = None,
     hybrid_context_parallel: bool = False,
     expert_model_parallel_size: int = 1,
+    gtp_remat_size: int = 1,
+    expert_gtp_remat_size: int = 1,
     num_distributed_optimizer_instances: int = 1,
     expert_tensor_parallel_size: Optional[int] = None,
     nccl_communicator_config_path: Optional[str] = None,
@@ -633,6 +676,22 @@ def initialize_model_parallel(
             The number of Mixture of Experts parallel GPUs in each expert
             parallel group.
 
+        gtp_remat_size (int, default = 1):
+            Generalized tensor parallelism with weight rematerialization (GTP).
+            Shards model weights along ``out_features`` across this many ranks;
+            each weight is rematerialized independently (per-weight, not per-
+            layer) via async all-gather on every forward AND backward pass. A
+            first-class orthogonal axis (world_size = TP*GTP*CP*DP). Maps to the
+            dataclass field ``ModelParallelConfig.gtp_weight_remat_size``.
+            NOTE: "remat" here is NOT activation recomputation/checkpointing.
+
+        expert_gtp_remat_size (int, default = 1):
+            Expert-side counterpart of ``gtp_remat_size`` — shards routed-expert
+            weights along ``out_features`` and rematerializes per-weight on
+            every forward AND backward pass. A first-class orthogonal axis on the
+            expert grid. Independent from ``gtp_remat_size``. Maps to
+            ``ModelParallelConfig.expert_gtp_weight_remat_size``.
+
         num_distributed_optimizer_instances (int, default = 1):
             The number of distributed optimizer replicas across the data-
             parallel domain.
@@ -730,7 +789,22 @@ def initialize_model_parallel(
         local_world_size if local_world_size is not None else torch.distributed.get_world_size()
     )
 
-    model_size = tensor_model_parallel_size * pipeline_model_parallel_size * context_parallel_size
+    # GTP requires a single distributed-optimizer instance: partial-distopt sharding of the
+    # data domain would need gtp-aware sizing. Assert early so all group builds below can
+    # assume one instance when GTP/EGTP is active.
+    assert not (
+        (gtp_remat_size > 1 or expert_gtp_remat_size > 1)
+        and num_distributed_optimizer_instances > 1
+    ), "GTP with num_distributed_optimizer_instances > 1 is not yet supported."
+
+    # gtp counts toward model_size (it consumes its own ranks and carries distinct data),
+    # so data_parallel_size becomes the gtp-EXCLUDED replicate degree.
+    model_size = (
+        tensor_model_parallel_size
+        * pipeline_model_parallel_size
+        * context_parallel_size
+        * gtp_remat_size
+    )
 
     if world_size % model_size != 0:
         raise RuntimeError(f"world_size ({world_size}) is not divisible by {model_size}")
@@ -767,21 +841,43 @@ def initialize_model_parallel(
     for pg_name in high_priority_stream_groups:
         overwrite_nccl_comm_cfgs(nccl_comm_cfgs, pg_name, ("is_high_priority_stream", True))
 
+    # GTP is a real RankGenerator axis: inject 'gtp' into the order. Position controls NCCL
+    # locality (leftmost token = smallest stride = most adjacent ranks):
+    #   - dense/decoder: inject after 'tp' → 'tp-gtp-cp-ep-dp-pp' (GTP gets local placement).
+    #   - expert: inject after 'ep' → 'tp-cp-ep-gtp-dp-pp' so EP keeps the more-local placement
+    #     than EGTP (the MoE EP all-to-all is the heavier expert-side collective).
+    # When gtp/egtp size is 1 the injected axis is a no-op (singleton groups).
+    def _inject_gtp(order_str: str, after: str = "tp") -> str:
+        toks = order_str.split("-")
+        if "gtp" in toks:
+            return order_str
+        anchor = after if after in toks else "tp"
+        pos = (toks.index(anchor) + 1) if anchor in toks else 0
+        toks.insert(pos, "gtp")
+        return "-".join(toks)
+
+    decoder_order = _inject_gtp(order, after="tp")
+
     decoder_rank_generator = RankGenerator(
         tp=tensor_model_parallel_size,
         ep=1,
         dp=data_parallel_size,
         pp=pipeline_model_parallel_size,
         cp=context_parallel_size,
-        order=order,
+        order=decoder_order,
         rank_offset=rank_offset,
+        gtp=gtp_remat_size,
     )
 
     # Build expert rank generator
     if expert_tensor_parallel_size is None:
         expert_tensor_parallel_size = tensor_model_parallel_size
+    # EGTP is a world-size factor for the expert grid too (mirrors gtp on the dense grid).
     expert_tensor_model_pipeline_parallel_size = (
-        expert_tensor_parallel_size * expert_model_parallel_size * pipeline_model_parallel_size
+        expert_tensor_parallel_size
+        * expert_model_parallel_size
+        * pipeline_model_parallel_size
+        * expert_gtp_remat_size
     )
     expert_data_parallel_size = world_size // expert_tensor_model_pipeline_parallel_size
     if world_size % expert_tensor_model_pipeline_parallel_size != 0:
@@ -789,15 +885,18 @@ def initialize_model_parallel(
             f"world_size ({world_size}) is not divisible by expert_tensor_model_pipeline_parallel size ({expert_tensor_model_pipeline_parallel_size})"
         )
 
-    # TODO: support expert specific ordering
+    # Expert grid: inject gtp AFTER 'ep' so EP outranks EGTP for NCCL locality (heavy MoE
+    # all-to-all stays on the more-adjacent ranks; EGTP AG/RS takes the outer placement).
+    expert_order = _inject_gtp(order, after="ep")
     expert_decoder_rank_generator = RankGenerator(
         tp=expert_tensor_parallel_size,
         ep=expert_model_parallel_size,
         dp=expert_data_parallel_size,
         pp=pipeline_model_parallel_size,
         cp=1,
-        order=order,
+        order=expert_order,
         rank_offset=rank_offset,
+        gtp=expert_gtp_remat_size,
     )
 
     assert (
@@ -833,6 +932,31 @@ def initialize_model_parallel(
         data_parallel_size * context_parallel_size
     ) // num_distributed_optimizer_instances
 
+    # Build the generalized tensor parallel groups.
+    # GTP overlaps with the CP-DP domain because GTP only shards weights
+    # while CP only shards activations — they are independent and can share ranks.
+    global _GTP_WEIGHT_REMAT_GROUP
+    global _GTP_WEIGHT_REMAT_GLOBAL_RANKS
+    assert (
+        _GTP_WEIGHT_REMAT_GROUP is None
+    ), "generalized tensor parallel group is already initialized"
+    for gtp_ranks in decoder_rank_generator.get_gtp_ranks(gtp_remat_size):
+        group = create_group(
+            gtp_ranks,
+            timeout=timeout,
+            pg_options=get_nccl_options("gtp", nccl_comm_cfgs),
+            group_desc="GTP_WEIGHT_REMAT_GROUP",
+        )
+        if rank in gtp_ranks:
+            _GTP_WEIGHT_REMAT_GROUP = group
+            _GTP_WEIGHT_REMAT_GLOBAL_RANKS = gtp_ranks
+
+    # Tokens for the FULL (gtp-inclusive) data-parallel domain. gtp is factored out of the
+    # generator's 'dp' axis, so the full data domain spans gtp explicitly ('gtp-dp'). The
+    # replicate (gtp-excluded) groups are the _*_NO_GTP variants below.
+    dp_full_token = "gtp-dp"
+    dp_cp_full_token = "gtp-dp-cp"
+
     # Set NCCL_COLLNET_ENABLE to 1 to enable SHARP for the dp group.
     if sharp_enabled_group == "dp":
         os.environ["NCCL_COLLNET_ENABLE"] = "1"
@@ -842,7 +966,7 @@ def initialize_model_parallel(
     # is eligible for using the NCCL COLLNET feature.
     # Therefore, dp-cp group, which potentially requires SHARP-enablement,
     # need to be created before all the other groups
-    for ranks_with_cp in decoder_rank_generator.get_ranks('dp-cp'):
+    for ranks_with_cp in decoder_rank_generator.get_ranks(dp_cp_full_token):
         group_with_cp = create_group(
             ranks_with_cp,
             timeout=timeout,
@@ -932,7 +1056,7 @@ def initialize_model_parallel(
             )
         # TODO: Are gloo groups needed for hybrid cp?
 
-    for ranks in decoder_rank_generator.get_ranks('dp'):
+    for ranks in decoder_rank_generator.get_ranks(dp_full_token):
         group = create_group(
             ranks,
             timeout=timeout,
@@ -950,6 +1074,46 @@ def initialize_model_parallel(
             _DATA_PARALLEL_GROUP_GLOO = group_gloo
             _DATA_PARALLEL_GLOBAL_RANKS = ranks
 
+    # Build DP groups with generalized tensor parallel accounted for.
+    # no_gtp DP = only ranks that share the same GTP-rank (true weight replicas).
+    global _DATA_PARALLEL_GROUP_NO_GTP
+    global _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP
+    global _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP
+    if gtp_remat_size > 1:
+        # The replicate (gtp-excluded) DP groups ARE get_ranks('dp') / get_ranks('dp-cp') by
+        # construction (gtp is its own axis). Every rank iterates all groups so each create_group
+        # collective is entered by all ranks.
+        for dp_ranks in decoder_rank_generator.get_ranks('dp'):
+            group = create_group(
+                dp_ranks,
+                timeout=timeout,
+                pg_options=get_nccl_options("dp_gtp", nccl_comm_cfgs),
+                group_desc="DATA_PARALLEL_GROUP_NO_GTP",
+            )
+            if rank in dp_ranks:
+                _DATA_PARALLEL_GROUP_NO_GTP = group
+
+        for dp_cp_ranks in decoder_rank_generator.get_ranks('dp-cp'):
+            group = create_group(
+                dp_cp_ranks,
+                timeout=timeout,
+                pg_options=get_nccl_options("dp_cp_gtp", nccl_comm_cfgs),
+                group_desc="DATA_PARALLEL_GROUP_WITH_CP_NO_GTP",
+            )
+            if rank in dp_cp_ranks:
+                _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = group
+
+        # GTP requires a single distributed-optimizer instance (asserted above), so the
+        # per-instance partial group is just the full replicate group.
+        _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP
+    else:
+        # GTP inactive (gtp_remat_size == 1): the replicate groups alias the regular DP groups.
+        _DATA_PARALLEL_GROUP_NO_GTP = _DATA_PARALLEL_GROUP
+        _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = _DATA_PARALLEL_GROUP_WITH_CP
+        _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = (
+            _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP
+        )
+
     # Build the context-parallel groups.
     global _CONTEXT_PARALLEL_GROUP
     global _CONTEXT_PARALLEL_GLOBAL_RANKS
@@ -979,11 +1143,12 @@ def initialize_model_parallel(
             if rank in ranks:
                 _HIERARCHICAL_CONTEXT_PARALLEL_GROUPS = hierarchical_groups
 
-    # Build the model-parallel groups.
+    # Build the model-parallel groups (TP × GTP × PP). gtp is a RankGenerator axis, so the
+    # 'tp-gtp-pp' token spans it directly; with gtp=1 it reduces to the plain tp-pp groups.
     global _MODEL_PARALLEL_GROUP
     global _MODEL_PARALLEL_GLOBAL_RANKS
     assert _MODEL_PARALLEL_GROUP is None, 'model parallel group is already initialized'
-    for ranks in decoder_rank_generator.get_ranks('tp-pp'):
+    for ranks in decoder_rank_generator.get_ranks('tp-gtp-pp'):
         group = create_group(
             ranks,
             timeout=timeout,
@@ -1167,6 +1332,26 @@ def initialize_model_parallel(
             _TENSOR_AND_CONTEXT_PARALLEL_GROUP = group
 
     ### Expert-related parallel groups initialization
+    # Build the expert generalized tensor parallel group
+    # Expert GTP overlaps with the expert DP domain (experts don't use CP).
+    global _EXPERT_GTP_WEIGHT_REMAT_GROUP
+    global _EXPERT_GTP_WEIGHT_REMAT_GLOBAL_RANKS
+    assert (
+        _EXPERT_GTP_WEIGHT_REMAT_GROUP is None
+    ), 'Expert generalized tensor parallel group is already initialized'
+    # EGTP shard groups are get_ranks('gtp') on the expert generator (singletons when
+    # expert_gtp_remat_size == 1). See RankGenerator.get_gtp_ranks.
+    for egtp_ranks in expert_decoder_rank_generator.get_gtp_ranks(expert_gtp_remat_size):
+        group = create_group(
+            egtp_ranks,
+            timeout=timeout,
+            pg_options=get_nccl_options("expt_gtp", nccl_comm_cfgs),
+            group_desc="EXPERT_GTP_WEIGHT_REMAT_GROUP",
+        )
+        if rank in egtp_ranks:
+            _EXPERT_GTP_WEIGHT_REMAT_GROUP = group
+            _EXPERT_GTP_WEIGHT_REMAT_GLOBAL_RANKS = egtp_ranks
+
     # Build the expert model parallel group
     global _EXPERT_MODEL_PARALLEL_GROUP, _EXPERT_MODEL_PARALLEL_RANKS
     assert _EXPERT_MODEL_PARALLEL_GROUP is None, 'Expert parallel group is already initialized'
@@ -1226,6 +1411,22 @@ def initialize_model_parallel(
         if rank in ranks:
             _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP = group
 
+    # Expert+tensor+pipeline group merged across EGTP peers — expert analog of the dense
+    # _MODEL_PARALLEL_GROUP merge (above). The 'tp-ep-gtp-pp' token spans the egtp axis; with
+    # expert_gtp_remat_size=1 it reduces to the plain tp-ep-pp groups. Merging gives EGTP peers
+    # distinct ranks; see docs/api-guide/core/generalized_tensor_parallel.md §3.3
+    # (Optimizer state) for the DCP-collision rationale.
+    global _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP
+    for ranks in expert_decoder_rank_generator.get_ranks('tp-ep-gtp-pp'):
+        group = create_group(
+            ranks,
+            timeout=timeout,
+            pg_options=get_nccl_options("tp_ep_pp", nccl_comm_cfgs),
+            group_desc="EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP",
+        )
+        if rank in ranks:
+            _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP = group
+
     # Build the expert data parallel group
     global _EXPERT_DATA_PARALLEL_GROUP
     assert _EXPERT_DATA_PARALLEL_GROUP is None, "Expert data group is already initialized"
@@ -1251,7 +1452,10 @@ def initialize_model_parallel(
         expert_data_parallel_size // num_distributed_optimizer_instances
     )
 
-    for ranks in expert_decoder_rank_generator.get_ranks('dp'):
+    # FULL (egtp-inclusive) expert data-parallel token (mirrors dp_full_token). Expert
+    # generator has cp=1, so the expert data domain spans gtp explicitly ('gtp-dp').
+    expert_dp_full_token = "gtp-dp"
+    for ranks in expert_decoder_rank_generator.get_ranks(expert_dp_full_token):
         group = create_group(
             ranks,
             timeout=timeout,
@@ -1307,6 +1511,26 @@ def initialize_model_parallel(
         else:
             _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP = _EXPERT_DATA_PARALLEL_GROUP
             _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_GLOO = _EXPERT_DATA_PARALLEL_GROUP_GLOO
+    # Build expert DP group with expert generalized tensor parallel accounted for.
+    global _EXPERT_DATA_PARALLEL_GROUP_NO_GTP
+    global _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_NO_GTP
+    if expert_gtp_remat_size > 1:
+        # The replicate (egtp-excluded) expert-DP groups ARE get_ranks('dp') (egtp is its own axis).
+        for dp_ranks in expert_decoder_rank_generator.get_ranks('dp'):
+            group = create_group(
+                dp_ranks,
+                timeout=timeout,
+                pg_options=get_nccl_options("ep_dp_gtp", nccl_comm_cfgs),
+                group_desc="EXPERT_DATA_PARALLEL_GROUP_NO_GTP",
+            )
+            if rank in dp_ranks:
+                _EXPERT_DATA_PARALLEL_GROUP_NO_GTP = group
+        _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_NO_GTP = _EXPERT_DATA_PARALLEL_GROUP_NO_GTP
+    else:
+        # EGTP inactive: the replicate group aliases the regular expert-DP group.
+        _EXPERT_DATA_PARALLEL_GROUP_NO_GTP = _EXPERT_DATA_PARALLEL_GROUP
+        _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_NO_GTP = _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP
+
     ### End of expert related parallel groups initialization
 
     # build the intra distributed optimizer instance group
@@ -1315,21 +1539,40 @@ def initialize_model_parallel(
         _INTRA_DISTRIBUTED_OPTIMIZER_INSTANCE_GROUP is None
     ), "Intra distributed optimizer instance group is already initialized"
 
-    model_parallel_group_id = 0
-    intra_dist_opt_ranks = []
-    for ranks in expert_decoder_rank_generator.get_ranks('tp-ep-pp'):
-        model_parallel_group_id += 1
-        intra_dist_opt_ranks.extend(ranks)
-        if model_parallel_group_id % intra_partial_expert_data_parallel_size == 0:
-            intra_dist_opt_instance_group = create_group(
-                intra_dist_opt_ranks,
-                timeout=timeout,
-                pg_options=get_nccl_options("intra_dist_opt_instance", nccl_comm_cfgs),
-                group_desc="INTRA_DISTRIBUTED_OPTIMIZER_INSTANCE_GROUP",
-            )
-            if rank in intra_dist_opt_ranks:
-                _INTRA_DISTRIBUTED_OPTIMIZER_INSTANCE_GROUP = intra_dist_opt_instance_group
-            intra_dist_opt_ranks = []
+    if gtp_remat_size > 1 or expert_gtp_remat_size > 1:
+        # GTP requires num_distributed_optimizer_instances == 1 (asserted above), so the dist-opt
+        # grad-stats group (used only for grad-norm + num_zeros reductions) must span the ENTIRE
+        # world. The per-instance accumulation below would NOT: gtp/egtp are factored out of
+        # expert_data_parallel_size (via expert_gtp_remat_size), so the expert-generator groups omit the
+        # gtp/egtp axes — under-counting the grad-norm for gtp/egtp-sharded params. Build one
+        # full-world group from all tp-ep-pp groups instead (get_ranks already applies rank_offset).
+        all_ranks = sorted(
+            r for ranks in expert_decoder_rank_generator.get_ranks('tp-ep-pp') for r in ranks
+        )
+        intra_dist_opt_instance_group = create_group(
+            all_ranks,
+            timeout=timeout,
+            pg_options=get_nccl_options("intra_dist_opt_instance", nccl_comm_cfgs),
+            group_desc="INTRA_DISTRIBUTED_OPTIMIZER_INSTANCE_GROUP",
+        )
+        if rank in all_ranks:
+            _INTRA_DISTRIBUTED_OPTIMIZER_INSTANCE_GROUP = intra_dist_opt_instance_group
+    else:
+        model_parallel_group_id = 0
+        intra_dist_opt_ranks = []
+        for ranks in expert_decoder_rank_generator.get_ranks('tp-ep-pp'):
+            model_parallel_group_id += 1
+            intra_dist_opt_ranks.extend(ranks)
+            if model_parallel_group_id % intra_partial_expert_data_parallel_size == 0:
+                intra_dist_opt_instance_group = create_group(
+                    intra_dist_opt_ranks,
+                    timeout=timeout,
+                    pg_options=get_nccl_options("intra_dist_opt_instance", nccl_comm_cfgs),
+                    group_desc="INTRA_DISTRIBUTED_OPTIMIZER_INSTANCE_GROUP",
+                )
+                if rank in intra_dist_opt_ranks:
+                    _INTRA_DISTRIBUTED_OPTIMIZER_INSTANCE_GROUP = intra_dist_opt_instance_group
+                intra_dist_opt_ranks = []
 
     # Initialize global memory buffer
     # This isn't really "parallel state" but there isn't another good place to
@@ -1455,6 +1698,42 @@ def get_tensor_model_parallel_group(check_initialized=True):
     return _TENSOR_MODEL_PARALLEL_GROUP
 
 
+def get_gtp_weight_remat_group(check_initialized=True):
+    """Get the parameter-sharding group the caller rank belongs to."""
+    if check_initialized:
+        assert (
+            _GTP_WEIGHT_REMAT_GROUP is not None
+        ), "generalized tensor parallel group is not initialized"
+    return _GTP_WEIGHT_REMAT_GROUP
+
+
+def get_gtp_weight_remat_world_size():
+    """Return world size for the parameter-sharding group."""
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
+        group = get_gtp_weight_remat_group(check_initialized=False)
+        return group.size() if group is not None else 0
+    else:
+        return 0
+
+
+def get_gtp_weight_remat_rank():
+    """Return caller's rank in the parameter-sharding group."""
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
+        group = get_gtp_weight_remat_group(check_initialized=False)
+        return group.rank() if group is not None else 0
+    else:
+        return 0
+
+
+def get_gtp_weight_remat_global_ranks(check_initialized=True):
+    """Get all global ranks of the parameter-sharding group that the caller rank belongs to."""
+    if check_initialized:
+        assert (
+            _GTP_WEIGHT_REMAT_GLOBAL_RANKS is not None
+        ), "generalized tensor parallel group is not initialized"
+    return _GTP_WEIGHT_REMAT_GLOBAL_RANKS
+
+
 def get_pipeline_model_parallel_group(check_initialized=True):
     """Get the pipeline-model-parallel group the caller rank belongs to."""
     if check_initialized:
@@ -1464,26 +1743,47 @@ def get_pipeline_model_parallel_group(check_initialized=True):
     return _PIPELINE_MODEL_PARALLEL_GROUP
 
 
-def get_data_parallel_group(with_context_parallel=False, partial_data_parallel=False):
-    """Get the data-parallel group the caller rank belongs to."""
-    if with_context_parallel:
-        if partial_data_parallel:
-            assert (
-                _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP is not None
-            ), "Intra partial data parallel group is not initialized"
-            return _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP
-        assert (
-            _DATA_PARALLEL_GROUP_WITH_CP is not None
-        ), "data parallel group with context parallel combined is not initialized"
-        return _DATA_PARALLEL_GROUP_WITH_CP
-    else:
-        assert _DATA_PARALLEL_GROUP is not None, "data parallel group is not initialized"
-        assert partial_data_parallel == False, "Partial DP for Optimizer needs to include CP"
-        return _DATA_PARALLEL_GROUP
+def get_data_parallel_group(with_context_parallel=False, no_gtp=False, partial_data_parallel=False):
+    """Get the data-parallel group the caller rank belongs to.
+
+    Args:
+        with_context_parallel: If True, include context-parallel ranks in the group.
+        no_gtp: If True, return only the true weight-replica ranks (exclude GTP peers).
+        partial_data_parallel: If True, return partial DP group (requires with_context_parallel).
+    """
+    assert (
+        with_context_parallel or not partial_data_parallel
+    ), "Partial DP for Optimizer needs to include CP"
+    # (no_gtp, with_context_parallel, partial_data_parallel) -> (group, description). The globals
+    # are read at call time (assigned during initialize_model_parallel). partial requires CP, so
+    # the (*, False, True) rows are unreachable and omitted.
+    group_table = {
+        (False, False, False): (_DATA_PARALLEL_GROUP, "data parallel group"),
+        (False, True, False): (_DATA_PARALLEL_GROUP_WITH_CP, "data parallel group with CP"),
+        (False, True, True): (
+            _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP,
+            "intra partial data parallel group with CP",
+        ),
+        (True, False, False): (_DATA_PARALLEL_GROUP_NO_GTP, "data parallel group (no GTP)"),
+        (True, True, False): (
+            _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP,
+            "data parallel group with CP (no GTP)",
+        ),
+        (True, True, True): (
+            _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP,
+            "intra partial data parallel group with CP (no GTP)",
+        ),
+    }
+    group, description = group_table[(no_gtp, with_context_parallel, partial_data_parallel)]
+    assert group is not None, f"{description} is not initialized"
+    return group
 
 
-def get_data_parallel_group_gloo(with_context_parallel=False, partial_data_parallel=False):
+def get_data_parallel_group_gloo(
+    with_context_parallel=False, no_gtp=False, partial_data_parallel=False
+):
     """Get the Gloo data-parallel group the caller rank belongs to."""
+    assert not no_gtp, "GTP does not support Gloo data-parallel groups"
     if with_context_parallel:
         if partial_data_parallel:
             assert (
@@ -1788,14 +2088,18 @@ def get_pipeline_model_parallel_prev_rank():
     return _PIPELINE_GLOBAL_RANKS[(rank_in_pipeline - 1) % world_size]
 
 
-def get_data_parallel_world_size(with_context_parallel=False, partial_data_parallel=False):
+def get_data_parallel_world_size(
+    with_context_parallel=False, no_gtp=False, partial_data_parallel=False
+):
     """Return world size for the data parallel group."""
     global _MPU_DATA_PARALLEL_WORLD_SIZE
     if _MPU_DATA_PARALLEL_WORLD_SIZE is not None:
         return _MPU_DATA_PARALLEL_WORLD_SIZE
     if torch.distributed.is_available() and torch.distributed.is_initialized():
         return get_data_parallel_group(
-            with_context_parallel=with_context_parallel, partial_data_parallel=partial_data_parallel
+            with_context_parallel=with_context_parallel,
+            no_gtp=no_gtp,
+            partial_data_parallel=partial_data_parallel,
         ).size()
     else:
         return 0
@@ -1807,14 +2111,16 @@ def set_data_parallel_rank(rank):
     _MPU_DATA_PARALLEL_RANK = rank
 
 
-def get_data_parallel_rank(with_context_parallel=False, partial_data_parallel=False):
+def get_data_parallel_rank(with_context_parallel=False, no_gtp=False, partial_data_parallel=False):
     """Return caller's rank in the data-parallel group."""
     global _MPU_DATA_PARALLEL_RANK
     if _MPU_DATA_PARALLEL_RANK is not None:
         return _MPU_DATA_PARALLEL_RANK
     if torch.distributed.is_available() and torch.distributed.is_initialized():
         return get_data_parallel_group(
-            with_context_parallel=with_context_parallel, partial_data_parallel=partial_data_parallel
+            with_context_parallel=with_context_parallel,
+            no_gtp=no_gtp,
+            partial_data_parallel=partial_data_parallel,
         ).rank()
     else:
         return 0
@@ -1853,6 +2159,42 @@ def get_tensor_and_context_parallel_rank():
 
 
 ### Expert-related parallel states functions
+def get_expert_gtp_weight_remat_group(check_initialized=True):
+    """Get the expert-parameter-sharding group the caller rank belongs to."""
+    if check_initialized:
+        assert (
+            _EXPERT_GTP_WEIGHT_REMAT_GROUP is not None
+        ), "expert generalized tensor parallel group is not initialized"
+    return _EXPERT_GTP_WEIGHT_REMAT_GROUP
+
+
+def get_expert_gtp_weight_remat_world_size():
+    """Return world size for the expert-parameter-sharding group."""
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
+        group = get_expert_gtp_weight_remat_group(check_initialized=False)
+        return group.size() if group is not None else 0
+    else:
+        return 0
+
+
+def get_expert_gtp_weight_remat_rank():
+    """Return caller's rank in the expert-parameter-sharding group."""
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
+        group = get_expert_gtp_weight_remat_group(check_initialized=False)
+        return group.rank() if group is not None else 0
+    else:
+        return 0
+
+
+def get_expert_gtp_weight_remat_global_ranks(check_initialized=True):
+    """Get all global ranks of the expert-parameter-sharding group that the caller rank belongs to."""
+    if check_initialized:
+        assert (
+            _EXPERT_GTP_WEIGHT_REMAT_GLOBAL_RANKS is not None
+        ), "expert generalized tensor parallel group is not initialized"
+    return _EXPERT_GTP_WEIGHT_REMAT_GLOBAL_RANKS
+
+
 def get_expert_model_parallel_group(check_initialized=True):
     """Get the expert-model-parallel group the caller rank belongs to."""
     if check_initialized:
@@ -1974,8 +2316,23 @@ def get_expert_tensor_and_model_parallel_rank():
         return 0
 
 
-def get_expert_tensor_model_pipeline_parallel_group(check_initialized=True):
-    """Get expert tensor-model-pipeline parallel group."""
+def get_expert_tensor_model_pipeline_parallel_group(check_initialized=True, with_egtp=False):
+    """Get expert tensor-model-pipeline parallel group.
+
+    Args:
+        check_initialized: If True (default), asserts the group has been created.
+        with_egtp: If True, return the EGTP-merged variant — the analog of dense
+            ``get_model_parallel_group()`` (which merges across GTP peers). Use this when you
+            need a group whose rank uniquely identifies each (ETP, EP, PP, EGTP) position;
+            e.g. for the MoE distributed optimizer's ``data_parallel_group_idx``. Identical
+            to the vanilla group when EGTP=1.
+    """
+    if with_egtp:
+        if check_initialized:
+            assert (
+                _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP is not None
+            ), "Expert tensor-model-pipeline parallel group with EGTP is not initialized"
+        return _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP
     if check_initialized:
         assert (
             _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP is not None
@@ -1983,24 +2340,32 @@ def get_expert_tensor_model_pipeline_parallel_group(check_initialized=True):
     return _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP
 
 
-def get_expert_data_parallel_group(check_initialized=True, partial_expert_data_parallel=False):
+def get_expert_data_parallel_group(
+    check_initialized=True, no_gtp=False, partial_expert_data_parallel=False
+):
     """Get expert data parallel group."""
-    if partial_expert_data_parallel:
-        if check_initialized:
-            assert (
-                _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP is not None
-            ), "Intra partial expert data parallel group is not initialized"
-        return _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP
-    else:
-        if check_initialized:
-            assert (
-                _EXPERT_DATA_PARALLEL_GROUP is not None
-            ), "Expert data parallel group is not initialized"
-        return _EXPERT_DATA_PARALLEL_GROUP
+    # (no_gtp, partial_expert_data_parallel) -> (group, description). Read at call time.
+    group_table = {
+        (False, False): (_EXPERT_DATA_PARALLEL_GROUP, "Expert data parallel group"),
+        (False, True): (
+            _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP,
+            "Intra partial expert data parallel group",
+        ),
+        (True, False): (_EXPERT_DATA_PARALLEL_GROUP_NO_GTP, "Expert data parallel group (no GTP)"),
+        (True, True): (
+            _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_NO_GTP,
+            "Intra partial expert data parallel group (no GTP)",
+        ),
+    }
+    group, description = group_table[(no_gtp, partial_expert_data_parallel)]
+    if check_initialized:
+        assert group is not None, f"{description} is not initialized"
+    return group
 
 
-def get_expert_data_parallel_group_gloo(partial_expert_data_parallel=False):
+def get_expert_data_parallel_group_gloo(no_gtp=False, partial_expert_data_parallel=False):
     """Get expert data parallel group-gloo."""
+    assert not no_gtp, "EGTP does not support Gloo expert-data-parallel groups"
     if partial_expert_data_parallel:
         assert (
             _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_GLOO is not None
@@ -2023,11 +2388,11 @@ def get_expert_data_parallel_rank(partial_expert_data_parallel=False):
         return 0
 
 
-def get_expert_data_parallel_world_size(partial_expert_data_parallel=False):
+def get_expert_data_parallel_world_size(no_gtp=False, partial_expert_data_parallel=False):
     """Return world size for the expert data parallel group."""
     if torch.distributed.is_available() and torch.distributed.is_initialized():
         return get_expert_data_parallel_group(
-            partial_expert_data_parallel=partial_expert_data_parallel
+            no_gtp=no_gtp, partial_expert_data_parallel=partial_expert_data_parallel
         ).size()
     else:
         return 0
@@ -2082,6 +2447,7 @@ def get_all_ranks():
     pipeline-model-parallel and expert-model-parallel groups."""
     ranks = [
         get_tensor_model_parallel_rank(),
+        get_gtp_weight_remat_rank(),
         get_data_parallel_rank(),
         get_context_parallel_rank(),
         get_pipeline_model_parallel_rank(),
@@ -2098,15 +2464,30 @@ def destroy_model_parallel():
     global _TENSOR_MODEL_PARALLEL_GROUP
     _TENSOR_MODEL_PARALLEL_GROUP = None
 
+    global _GTP_WEIGHT_REMAT_GROUP
+    _GTP_WEIGHT_REMAT_GROUP = None
+
+    global _GTP_WEIGHT_REMAT_GLOBAL_RANKS
+    _GTP_WEIGHT_REMAT_GLOBAL_RANKS = None
+
     global _PIPELINE_MODEL_PARALLEL_GROUP
     _PIPELINE_MODEL_PARALLEL_GROUP = None
 
     global _DATA_PARALLEL_GROUP
     _DATA_PARALLEL_GROUP = None
 
+    global _DATA_PARALLEL_GROUP_NO_GTP
+    _DATA_PARALLEL_GROUP_NO_GTP = None
+
     global _DATA_PARALLEL_GROUP_WITH_CP
     _DATA_PARALLEL_GROUP_WITH_CP = None
 
+    global _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP
+    _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = None
+
+    global _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP
+    _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP = None
+
     global _CONTEXT_PARALLEL_GROUP
     _CONTEXT_PARALLEL_GROUP = None
 
@@ -2173,6 +2554,12 @@ def destroy_model_parallel():
     _DATA_PARALLEL_GROUP_WITH_CP_GLOO = None
 
     # Destroy parallel state related to expert parallelism.
+    global _EXPERT_GTP_WEIGHT_REMAT_GROUP
+    _EXPERT_GTP_WEIGHT_REMAT_GROUP = None
+
+    global _EXPERT_GTP_WEIGHT_REMAT_GLOBAL_RANKS
+    _EXPERT_GTP_WEIGHT_REMAT_GLOBAL_RANKS = None
+
     global _EXPERT_MODEL_PARALLEL_GROUP
     _EXPERT_MODEL_PARALLEL_GROUP = None
 
@@ -2197,9 +2584,18 @@ def destroy_model_parallel():
     global _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP
     _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP = None
 
+    global _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP
+    _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP = None
+
     global _EXPERT_DATA_PARALLEL_GROUP
     _EXPERT_DATA_PARALLEL_GROUP = None
 
+    global _EXPERT_DATA_PARALLEL_GROUP_NO_GTP
+    _EXPERT_DATA_PARALLEL_GROUP_NO_GTP = None
+
+    global _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_NO_GTP
+    _INTRA_PARTIAL_EXPERT_DATA_PARALLEL_GROUP_NO_GTP = None
+
     global _EXPERT_DATA_PARALLEL_GROUP_GLOO
     if (
         _EXPERT_DATA_PARALLEL_GROUP_GLOO is not None
diff --git a/megatron/core/process_groups_config.py b/megatron/core/process_groups_config.py
index 6c1e3651387..647aa9b4a3e 100644
--- a/megatron/core/process_groups_config.py
+++ b/megatron/core/process_groups_config.py
@@ -44,13 +44,23 @@ class ProcessGroupCollection:
         expt_tp: Expert tensor parallel group
         tp_ep: Tensor and expert parallel group
         tp_ep_pp: Tensor, expert, and pipeline parallel group
+        tp_ep_pp_with_egtp: tp_ep_pp merged across EGTP peers (analog of dense ``mp`` under GTP);
+            identical to ``tp_ep_pp`` when EGTP=1
 
         # Data Parallelism Groups
         dp: Data parallel process group
         dp_cp: Data and context parallel group
+        dp_cp_no_gtp: Data and context parallel group excluding GTP peers
+            (true dense-weight replicas); identical to dp_cp when GTP=1
         expt_dp: Expert data parallel group
+        expt_dp_no_egtp: Expert data parallel group excluding EGTP peers
+            (true expert-weight replicas); identical to expt_dp when EGTP=1
         intra_dp_cp: Intra partial data parallel group
+        intra_dp_cp_no_gtp: Intra partial data parallel group excluding GTP peers
+            (true dense-weight replicas); identical to intra_dp_cp when GTP=1
         intra_expt_dp: Intra partial expert data parallel group
+        intra_expt_dp_no_egtp: Intra expert data parallel group excluding EGTP peers
+            (true expert-weight replicas); identical to intra_expt_dp when EGTP=1
         inter_dist_opt: Inter distributed optimizer instance group
 
     Example:
@@ -104,6 +114,11 @@ class ProcessGroupCollection:
     # _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP
     tp_ep_pp: torch.distributed.ProcessGroup = field(init=False)
 
+    # _EXPERT_TENSOR_MODEL_PIPELINE_PARALLEL_GROUP_WITH_EGTP — expert "model parallel" group
+    # merged across EGTP peers (analog of dense ``mp`` under GTP). Identical to ``tp_ep_pp``
+    # when EGTP=1.
+    tp_ep_pp_with_egtp: torch.distributed.ProcessGroup = field(init=False)
+
     # _TENSOR_AND_DATA_PARALLEL_GROUP_WITH_CP
     tp_dp_cp: torch.distributed.ProcessGroup = field(init=False)
 
@@ -114,9 +129,19 @@ class ProcessGroupCollection:
     # _DATA_PARALLEL_GROUP_WITH_CP
     dp_cp: torch.distributed.ProcessGroup = field(init=False)
 
+    # _DATA_PARALLEL_GROUP_WITH_CP_NO_GTP — DP+CP excluding GTP peers (true dense-weight
+    # replicas). Identical to ``dp_cp`` when GTP=1.
+    dp_cp_no_gtp: torch.distributed.ProcessGroup = field(init=False)
+
     # Separate dp_cp communicator for param all-gather (AG/RS overlap)
     dp_cp_ag: torch.distributed.ProcessGroup = field(init=False)
 
+    # _GTP_WEIGHT_REMAT_GROUP
+    gtp: torch.distributed.ProcessGroup = field(init=False)
+
+    # _EXPERT_GTP_WEIGHT_REMAT_GROUP
+    expt_gtp: torch.distributed.ProcessGroup = field(init=False)
+
     # MoE layers need expt_dp group for sharded state dict
     # we need this workaround until distributed checkpoint is refactored
     # to have sharded_state_dict can take the PG and pass it down
@@ -124,15 +149,27 @@ class ProcessGroupCollection:
     # _EXPERT_DATA_PARALLEL_GROUP
     expt_dp: torch.distributed.ProcessGroup = field(init=False)
 
+    # _EXPERT_DATA_PARALLEL_GROUP_NO_EGTP — expert DP excluding EGTP peers (true expert-weight
+    # replicas). Identical to ``expt_dp`` when EGTP=1.
+    expt_dp_no_egtp: torch.distributed.ProcessGroup = field(init=False)
+
     # _EXPERT_DATA_PARALLEL_GROUP_AG
     expt_dp_ag: torch.distributed.ProcessGroup = field(init=False)
 
     # _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP
     intra_dp_cp: torch.distributed.ProcessGroup = field(init=False)
 
+    # _INTRA_PARTIAL_DATA_PARALLEL_GROUP_WITH_CP_NO_GTP — intra-instance DP+CP excluding GTP
+    # peers (true dense-weight replicas). Identical to ``intra_dp_cp`` when GTP=1.
+    intra_dp_cp_no_gtp: torch.distributed.ProcessGroup = field(init=False)
+
     # _INTRA_EXPERT_DATA_PARALLEL_GROUP
     intra_expt_dp: torch.distributed.ProcessGroup = field(init=False)
 
+    # _INTRA_EXPERT_DATA_PARALLEL_GROUP_NO_EGTP — intra-instance expert DP excluding EGTP
+    # peers (true expert-weight replicas). Identical to ``intra_expt_dp`` when EGTP=1.
+    intra_expt_dp_no_egtp: torch.distributed.ProcessGroup = field(init=False)
+
     # _INTER_PARTIAL_EXPERT_DATA_PARALLEL_GROUP
     inter_dist_opt: torch.distributed.ProcessGroup = field(init=False)
 
@@ -146,19 +183,27 @@ def __init__(self, **kwargs):
             else:
                 raise ValueError(f"Unknown attribute: {key}")
 
+    def __getattr__(self, name: str):
+        # Return None for any declared field that was not set during partial construction
+        # (e.g. when use_mpu_process_groups is called with a subset of required_pgs).
+        if name in {f.name for f in fields(self.__class__)}:
+            return None
+        raise AttributeError(f"'ProcessGroupCollection' object has no attribute '{name}'")
+
     def __repr__(self):
         """Return a concise representation showing which process groups exist and their sizes."""
         active_pgs = []
         for field_info in fields(self):
-            if hasattr(self, field_info.name):
-                pg = getattr(self, field_info.name)
-                if pg is None:
-                    active_pgs.append(f"{field_info.name}(None)")
-                elif isinstance(pg, list):
-                    sizes = [g.size() for g in pg]
-                    active_pgs.append(f"{field_info.name}({sizes})")
-                else:
-                    active_pgs.append(f"{field_info.name}({pg.size()})")
+            if field_info.name not in vars(self):
+                continue
+            pg = getattr(self, field_info.name)
+            if pg is None:
+                continue
+            elif isinstance(pg, list):
+                sizes = [g.size() for g in pg]
+                active_pgs.append(f"{field_info.name}({sizes})")
+            else:
+                active_pgs.append(f"{field_info.name}({pg.size()})")
         return (
             f"ProcessGroupCollection({', '.join(active_pgs)})"
             if active_pgs
@@ -212,23 +257,43 @@ def use_mpu_process_groups(cls, required_pgs: Optional[List[str]] = None):
                 parallel_state.get_expert_tensor_model_pipeline_parallel_group,
                 check_initialized=False,
             ),
+            'tp_ep_pp_with_egtp': partial(
+                parallel_state.get_expert_tensor_model_pipeline_parallel_group,
+                check_initialized=False,
+                with_egtp=True,
+            ),
             'embd': partial(parallel_state.get_embedding_group, check_initialized=False),
             'pos_embd': partial(
                 parallel_state.get_position_embedding_group, check_initialized=False
             ),
             'dp': parallel_state.get_data_parallel_group,
             'dp_cp': partial(parallel_state.get_data_parallel_group, with_context_parallel=True),
+            'dp_cp_no_gtp': partial(
+                parallel_state.get_data_parallel_group, with_context_parallel=True, no_gtp=True
+            ),
             'dp_cp_ag': lambda: None,
             'intra_dp_cp': partial(
                 parallel_state.get_data_parallel_group,
                 with_context_parallel=True,
                 partial_data_parallel=True,
             ),
+            'intra_dp_cp_no_gtp': partial(
+                parallel_state.get_data_parallel_group,
+                with_context_parallel=True,
+                no_gtp=True,
+                partial_data_parallel=True,
+            ),
             'intra_expt_dp': partial(
                 parallel_state.get_expert_data_parallel_group,
                 check_initialized=False,
                 partial_expert_data_parallel=True,
             ),
+            'intra_expt_dp_no_egtp': partial(
+                parallel_state.get_expert_data_parallel_group,
+                check_initialized=False,
+                no_gtp=True,
+                partial_expert_data_parallel=True,
+            ),
             'inter_dist_opt': partial(
                 parallel_state.get_inter_distributed_optimizer_instance_group,
                 check_initialized=False,
@@ -241,12 +306,19 @@ def use_mpu_process_groups(cls, required_pgs: Optional[List[str]] = None):
             'expt_dp': partial(
                 parallel_state.get_expert_data_parallel_group, check_initialized=False
             ),
+            'expt_dp_no_egtp': partial(
+                parallel_state.get_expert_data_parallel_group, check_initialized=False, no_gtp=True
+            ),
             'expt_dp_ag': lambda: None,
             'tp_dp_cp': partial(
                 parallel_state.get_tensor_and_data_parallel_group,
                 check_initialized=False,
                 with_context_parallel=True,
             ),
+            'gtp': partial(parallel_state.get_gtp_weight_remat_group, check_initialized=False),
+            'expt_gtp': partial(
+                parallel_state.get_expert_gtp_weight_remat_group, check_initialized=False
+            ),
         }
 
         assert all(
@@ -259,6 +331,19 @@ def use_mpu_process_groups(cls, required_pgs: Optional[List[str]] = None):
 
         return cls(**init_dict)
 
+    @staticmethod
+    def is_gtp_active(process_group_dict: Dict) -> bool:
+        """True iff GTP or EGTP is active (a weight-shard group spans >1 rank).
+
+        Reads the 'gtp_group'/'expt_gtp_group' entries produced by both setup_process_groups_for_*
+        builders; a None group means that axis is unused.
+        """
+        gtp = process_group_dict.get('gtp_group')
+        expt_gtp = process_group_dict.get('expt_gtp_group')
+        return (gtp is not None and gtp.size() > 1) or (
+            expt_gtp is not None and expt_gtp.size() > 1
+        )
+
     @staticmethod
     def setup_process_groups_for_optimizer(
         pg_collection: Optional['ProcessGroupCollection'],
@@ -279,8 +364,14 @@ def setup_process_groups_for_optimizer(
                 - dp_group: Data parallel group
                 - dp_cp_group: Data parallel with context parallel group
                 - intra_dp_cp_group: Intra data parallel with context parallel group
+                - intra_dp_cp_no_gtp_group: Intra data parallel with context parallel and
+                    generalized tensor parallel group (excludes GTP peers, i.e. only true dense
+                    weight replicas)
                 - expt_dp_group: Expert data parallel group
                 - intra_expt_dp_group: Intra expert data parallel group
+                - intra_expt_dp_no_egtp_group: Intra expert data parallel group excluding
+                    EGTP peers (true expert-weight replicas); identical to expt_dp_group when
+                    EGTP=1
                 - mp_group: Model parallel group
                 - expt_tp_pp_group: Expert tensor-model-pipeline parallel group
                 - inter_dist_opt_group: Inter distributed optimizer group (may be None)
@@ -293,6 +384,9 @@ def setup_process_groups_for_optimizer(
 
         if pg_collection is None:
             # Use parallel_state groups
+            # Dense (non-GTP) params use no_gtp=False (full DP group) to maximize
+            # optimizer state sharding. GTP params use no_gtp=True (smaller group)
+            # since GTP's reduce-scatter already handled the GTP dimension.
             dp_group = parallel_state.get_data_parallel_group(
                 with_context_parallel=False, partial_data_parallel=False
             )
@@ -302,10 +396,24 @@ def setup_process_groups_for_optimizer(
             intra_dp_cp_group = parallel_state.get_data_parallel_group(
                 with_context_parallel=True, partial_data_parallel=True
             )
+            intra_dp_cp_no_gtp_group = parallel_state.get_data_parallel_group(
+                with_context_parallel=True, no_gtp=True, partial_data_parallel=True
+            )
+            dp_cp_no_gtp_group = parallel_state.get_data_parallel_group(
+                with_context_parallel=True, no_gtp=True
+            )
             expt_dp_group = parallel_state.get_expert_data_parallel_group()
             intra_expt_dp_group = parallel_state.get_expert_data_parallel_group(
                 partial_expert_data_parallel=True
             )
+            intra_expt_dp_no_egtp_group = parallel_state.get_expert_data_parallel_group(
+                no_gtp=True, partial_expert_data_parallel=True
+            )
+            expt_dp_no_egtp_group = parallel_state.get_expert_data_parallel_group(no_gtp=True)
+            gtp_group = parallel_state.get_gtp_weight_remat_group(check_initialized=False)
+            expt_gtp_group = parallel_state.get_expert_gtp_weight_remat_group(
+                check_initialized=False
+            )
             intra_dist_opt_group = parallel_state.get_intra_distributed_optimizer_instance_group()
 
             # Gloo groups
@@ -323,6 +431,9 @@ def setup_process_groups_for_optimizer(
             # Model communication groups
             mp_group = parallel_state.get_model_parallel_group()
             expt_tp_pp_group = parallel_state.get_expert_tensor_model_pipeline_parallel_group()
+            expt_tp_pp_with_egtp_group = (
+                parallel_state.get_expert_tensor_model_pipeline_parallel_group(with_egtp=True)
+            )
 
             # Inter distributed optimizer group
             if hasattr(model_chunks[0], 'ddp_config'):
@@ -338,14 +449,15 @@ def setup_process_groups_for_optimizer(
 
         else:
             # Use provided process group collection with validation and fallbacks
+            pg_set = vars(pg_collection)
 
             # 1. dp group - this is always required
-            if not hasattr(pg_collection, 'dp'):
+            if 'dp' not in pg_set:
                 raise ValueError("dp process group is required but not provided in pg_collection")
             dp_group = pg_collection.dp
 
             # 2. dp_cp group: fallback logic based on context_parallel_size
-            if hasattr(pg_collection, 'dp_cp'):
+            if 'dp_cp' in pg_set:
                 dp_cp_group = pg_collection.dp_cp
             else:
                 model_config = get_model_config(model_chunks[0])
@@ -360,7 +472,7 @@ def setup_process_groups_for_optimizer(
                     )
 
             # 3. Handle expert data parallel group
-            if not hasattr(pg_collection, 'expt_dp'):
+            if 'expt_dp' not in pg_set:
                 raise ValueError(
                     "expt_dp process group is required but not provided in pg_collection. "
                     "Please explicitly set it to None if you don't need it."
@@ -381,10 +493,10 @@ def setup_process_groups_for_optimizer(
                 else:
                     # With multiple optimizer instances, both groups must be provided
                     if not (
-                        hasattr(pg_collection, 'intra_dp_cp')
-                        and hasattr(pg_collection, 'intra_expt_dp')
-                        and hasattr(pg_collection, 'inter_dist_opt')
-                        and hasattr(pg_collection, 'intra_dist_opt')
+                        'intra_dp_cp' in pg_set
+                        and 'intra_expt_dp' in pg_set
+                        and 'inter_dist_opt' in pg_set
+                        and 'intra_dist_opt' in pg_set
                     ):
                         raise ValueError(
                             "intra_dp_cp, intra_expt_dp, inter_dist_opt, and intra_dist_opt "
@@ -396,7 +508,7 @@ def setup_process_groups_for_optimizer(
                     inter_dist_opt_group = pg_collection.inter_dist_opt
 
                 if ddp_config.use_distributed_optimizer:
-                    if not hasattr(pg_collection, 'intra_dist_opt'):
+                    if 'intra_dist_opt' not in pg_set:
                         raise ValueError(
                             "intra_dist_opt process group is required but not provided in "
                             "pg_collection. Please explicitly set it to None if you don't need it."
@@ -412,7 +524,7 @@ def setup_process_groups_for_optimizer(
                 intra_dist_opt_group = None
 
             # 5. Model communication groups
-            if not hasattr(pg_collection, 'mp'):
+            if 'mp' not in pg_set:
                 raise ValueError(
                     "mp process group is required but not provided in pg_collection. "
                     "Please explicitly set it to None if you don't need it."
@@ -420,13 +532,49 @@ def setup_process_groups_for_optimizer(
             mp_group = pg_collection.mp
 
             # Expert tensor-model-pipeline group for MoE
-            if not hasattr(pg_collection, 'tp_ep_pp'):
+            if 'tp_ep_pp' not in pg_set:
                 raise ValueError(
                     "tp_ep_pp process group is required but not provided in pg_collection. "
                     "Please explicitly set it to None if you don't need it."
                 )
             expt_tp_pp_group = pg_collection.tp_ep_pp
 
+            # EGTP-MERGED variant of tp_ep_pp: includes the egtp axis, so each EGTP peer gets a
+            # distinct rank — used for the distopt ShardedObject keys. Falls back to tp_ep_pp
+            # when not provided.
+            if 'tp_ep_pp_with_egtp' in pg_set:
+                expt_tp_pp_with_egtp_group = pg_collection.tp_ep_pp_with_egtp
+            else:
+                expt_tp_pp_with_egtp_group = expt_tp_pp_group
+
+            # 6. no_gtp groups — the gtp-EXCLUDED replicate groups that DDP and the optimizer
+            #    shard over: intra (per-distopt-instance) and full (cross-instance). Fall back to
+            #    the non-GTP variants when not provided.
+            if 'intra_dp_cp_no_gtp' in pg_set:
+                intra_dp_cp_no_gtp_group = pg_collection.intra_dp_cp_no_gtp
+            else:
+                intra_dp_cp_no_gtp_group = intra_dp_cp_group
+            if 'dp_cp_no_gtp' in pg_set:
+                dp_cp_no_gtp_group = pg_collection.dp_cp_no_gtp
+            else:
+                dp_cp_no_gtp_group = dp_cp_group
+
+            # 7. no_egtp groups — the expert analog of §6: the egtp-EXCLUDED replicate groups,
+            #    intra (per-distopt-instance) and full (cross-instance). Fall back to the
+            #    non-EGTP variants when not provided.
+            if 'intra_expt_dp_no_egtp' in pg_set:
+                intra_expt_dp_no_egtp_group = pg_collection.intra_expt_dp_no_egtp
+            else:
+                intra_expt_dp_no_egtp_group = intra_expt_dp_group
+            if 'expt_dp_no_egtp' in pg_set:
+                expt_dp_no_egtp_group = pg_collection.expt_dp_no_egtp
+            else:
+                expt_dp_no_egtp_group = expt_dp_group
+
+            # 8. GTP weight-shard groups (None when inactive); used to detect whether GTP is on.
+            gtp_group = getattr(pg_collection, 'gtp', None)
+            expt_gtp_group = getattr(pg_collection, 'expt_gtp', None)
+
             # Gloo groups - not supported when pg_collection is provided
             if use_gloo_process_groups:
                 raise ValueError(
@@ -439,11 +587,18 @@ def setup_process_groups_for_optimizer(
         return {
             'dp_group': dp_group,
             'dp_cp_group': dp_cp_group,
+            'dp_cp_no_gtp_group': dp_cp_no_gtp_group,
             'intra_dp_cp_group': intra_dp_cp_group,
+            'intra_dp_cp_no_gtp_group': intra_dp_cp_no_gtp_group,
             'expt_dp_group': expt_dp_group,
+            'expt_dp_no_egtp_group': expt_dp_no_egtp_group,
             'intra_expt_dp_group': intra_expt_dp_group,
+            'intra_expt_dp_no_egtp_group': intra_expt_dp_no_egtp_group,
+            'gtp_group': gtp_group,
+            'expt_gtp_group': expt_gtp_group,
             'mp_group': mp_group,
             'expt_tp_pp_group': expt_tp_pp_group,
+            'expt_tp_pp_with_egtp_group': expt_tp_pp_with_egtp_group,
             'inter_dist_opt_group': inter_dist_opt_group,
             'intra_dist_opt_group': intra_dist_opt_group,
             'intra_dp_cp_group_gloo': intra_dp_cp_group_gloo,
@@ -487,12 +642,20 @@ def setup_process_groups_for_ddp(
                     with_context_parallel=True, partial_data_parallel=True
                 ),
                 'expt_dp_group': parallel_state.get_expert_data_parallel_group(),
+                'expt_dp_no_egtp_group': parallel_state.get_expert_data_parallel_group(no_gtp=True),
                 'intra_expt_dp_group': parallel_state.get_expert_data_parallel_group(
                     partial_expert_data_parallel=True
                 ),
+                'intra_expt_dp_no_egtp_group': parallel_state.get_expert_data_parallel_group(
+                    no_gtp=True, partial_expert_data_parallel=True
+                ),
                 'tp_group': parallel_state.get_tensor_model_parallel_group(),
+                'gtp_group': parallel_state.get_gtp_weight_remat_group(check_initialized=False),
                 'pp_group': parallel_state.get_pipeline_model_parallel_group(),
                 'ep_group': parallel_state.get_expert_model_parallel_group(),
+                'expt_gtp_group': parallel_state.get_expert_gtp_weight_remat_group(
+                    check_initialized=False
+                ),
                 'inter_dist_opt_group': (
                     parallel_state.get_inter_distributed_optimizer_instance_group()
                     if ddp_config.num_distributed_optimizer_instances > 1
@@ -503,18 +666,25 @@ def setup_process_groups_for_ddp(
                     if ddp_config.use_distributed_optimizer
                     else None
                 ),
+                'intra_dp_cp_no_gtp_group': parallel_state.get_data_parallel_group(
+                    with_context_parallel=True, no_gtp=True, partial_data_parallel=True
+                ),
+                'dp_cp_no_gtp_group': parallel_state.get_data_parallel_group(
+                    with_context_parallel=True, no_gtp=True
+                ),
             }
         else:
             # Use provided process group collection with validation and fallbacks
             result = {}
+            pg_set = vars(pg_collection)
 
             # 1. dp group - this is always required
-            if not hasattr(pg_collection, 'dp'):
+            if 'dp' not in pg_set:
                 raise ValueError("dp process group is required but not provided in pg_collection")
             result['dp_group'] = pg_collection.dp
 
             # 2. dp_cp group: fallback logic based on context_parallel_size
-            if hasattr(pg_collection, 'dp_cp'):
+            if 'dp_cp' in pg_set:
                 result['dp_cp_group'] = pg_collection.dp_cp
             else:
                 cp_size = getattr(config, 'context_parallel_size', 1)
@@ -550,9 +720,9 @@ def setup_process_groups_for_ddp(
             else:
                 # With multiple optimizer instances, groups must be provided
                 if not (
-                    hasattr(pg_collection, 'intra_dp_cp')
-                    and hasattr(pg_collection, 'intra_expt_dp')
-                    and hasattr(pg_collection, 'inter_dist_opt')
+                    'intra_dp_cp' in pg_set
+                    and 'intra_expt_dp' in pg_set
+                    and 'inter_dist_opt' in pg_set
                 ):
                     raise ValueError(
                         "intra_dp_cp, intra_expt_dp, and inter_dist_opt "
@@ -564,13 +734,7 @@ def setup_process_groups_for_ddp(
                 result['inter_dist_opt_group'] = pg_collection.inter_dist_opt
 
             # 5. Model parallel groups (DDP-specific: tp, pp, ep instead of mp, expt_tp_pp)
-            if not all(
-                [
-                    hasattr(pg_collection, 'tp'),
-                    hasattr(pg_collection, 'pp'),
-                    hasattr(pg_collection, 'ep'),
-                ]
-            ):
+            if not all(['tp' in pg_set, 'pp' in pg_set, 'ep' in pg_set]):
                 raise ValueError(
                     "tp, pp and ep process groups are required but not provided in pg_collection"
                 )
@@ -578,6 +742,34 @@ def setup_process_groups_for_ddp(
             result['pp_group'] = pg_collection.pp
             result['ep_group'] = pg_collection.ep
 
+            # 6. GTP partial group (fallback to intra_dp_cp if not provided)
+            if 'intra_dp_cp_no_gtp' in pg_set:
+                result['intra_dp_cp_no_gtp_group'] = pg_collection.intra_dp_cp_no_gtp
+            else:
+                result['intra_dp_cp_no_gtp_group'] = result['intra_dp_cp_group']
+
+            # 7. EGTP partial group (fallback to intra_expt_dp if not provided)
+            if 'intra_expt_dp_no_egtp' in pg_set:
+                result['intra_expt_dp_no_egtp_group'] = pg_collection.intra_expt_dp_no_egtp
+            else:
+                result['intra_expt_dp_no_egtp_group'] = result['intra_expt_dp_group']
+
+            # 8. Full (cross-instance) with-GTP-excluded variants for callers that need to
+            # reach ALL true weight replicas (e.g., broadcast_params at init). Fall back
+            # to the corresponding non-GTP-excluded full group when not provided.
+            if 'dp_cp_no_gtp' in pg_set:
+                result['dp_cp_no_gtp_group'] = pg_collection.dp_cp_no_gtp
+            else:
+                result['dp_cp_no_gtp_group'] = result['dp_cp_group']
+            if 'expt_dp_no_egtp' in pg_set:
+                result['expt_dp_no_egtp_group'] = pg_collection.expt_dp_no_egtp
+            else:
+                result['expt_dp_no_egtp_group'] = result['expt_dp_group']
+
+            # 9. GTP weight-shard groups (None when inactive); used to detect whether GTP is on.
+            result['gtp_group'] = getattr(pg_collection, 'gtp', None)
+            result['expt_gtp_group'] = getattr(pg_collection, 'expt_gtp', None)
+
             return result
 
 
diff --git a/megatron/core/ssm/mamba_mixer.py b/megatron/core/ssm/mamba_mixer.py
index 060234fcadd..40dd5d348c1 100644
--- a/megatron/core/ssm/mamba_mixer.py
+++ b/megatron/core/ssm/mamba_mixer.py
@@ -30,6 +30,7 @@
 from megatron.core.ssm.ops.causal_conv1d_triton import causal_conv1d_update
 from megatron.core.ssm.ops.mamba_ssm import selective_state_update
 from megatron.core.tensor_parallel import get_cuda_rng_tracker
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
 from megatron.core.transformer import TransformerConfig
 from megatron.core.transformer.module import MegatronModule
 from megatron.core.transformer.spec_utils import ModuleSpec, build_module
@@ -45,8 +46,14 @@
     is_mamba_min_version,
     is_using_quantization_scales,
     log_single_rank,
+    make_tp_sharded_tensor_for_checkpoint,
 )
 
+if HAVE_GTP:
+    from megatron.core.tensor_parallel.gtp import GTPShardedParam
+else:
+    GTPShardedParam = None
+
 from .mamba_context_parallel import MambaContextParallel
 
 try:
@@ -280,6 +287,7 @@ def __init__(
             tp_comm_buffer_name="fc1",
             tp_group=self.pg_collection.tp,
             name=(name + f".in_proj") if name is not None else None,
+            gtp_group=self.pg_collection.gtp,
         )
         # in_proj packs [z, x, B, C, dt] into one ColumnParallelLinear.  Each
         # component is independently TP-sharded but with different sizes.  When
@@ -431,6 +439,7 @@ def __init__(
             tp_comm_buffer_name="fc2",
             tp_group=self.pg_collection.tp,
             name=(name + f".out_proj") if name is not None else None,
+            gtp_group=self.pg_collection.gtp,
         )
 
         # Regarding `conv1d`.{`weight`, `bias`}, `dt_bias`, `A_log`, and `D`: these are the
@@ -1359,6 +1368,35 @@ def sharded_state_dict(self, prefix="", sharded_offsets=(), metadata=None):
             + 2 * self.ngroups_local_tp * self.d_state
             + self.nheads_local_tp
         )
+        # Under GTP, in_proj.weight is GTP-sliced along axis 0. The [z|x|B|C|dt] split boundaries
+        # don't line up with GTP slice boundaries, so gather the shards back to TP-local size
+        # (strip the trailing pad rows from the gathered tail) and fall through to the same
+        # split path the non-GTP run uses — saved ckpt format matches a non-GTP run.
+        in_proj_gtp_size = getattr(self.in_proj, "gtp_size", 1)
+        if in_proj_gtp_size > 1 and HAVE_GTP and isinstance(self.in_proj.weight, GTPShardedParam):
+            gtp_shard = self.in_proj.weight
+            gtp_group = gtp_shard.group
+            local = gtp_shard.data.contiguous()
+            gathered = torch.empty(
+                (local.shape[0] * in_proj_gtp_size,) + local.shape[1:],
+                dtype=local.dtype,
+                device=local.device,
+            )
+            torch.distributed.all_gather_into_tensor(gathered, local, group=gtp_group)
+            if gathered.shape[0] != in_proj_dim:
+                gathered = gathered[:in_proj_dim].contiguous()
+            # Gathered weight is replicated across full dp_cp; replica_id needs only the DP slot.
+            dp_cp_rank = torch.distributed.get_rank(metadata['dp_cp_group'])
+            sharded_state_dict[f"{prefix}in_proj.weight"] = make_tp_sharded_tensor_for_checkpoint(
+                gathered,
+                f"{prefix}in_proj.weight",
+                tp_axis=0,
+                replica_id=(0, 0, dp_cp_rank),
+                prepend_offsets=sharded_offsets,
+                tp_group=self.tp_group,
+                dp_cp_group=metadata['dp_cp_group'],
+            )
+
         assert sharded_state_dict[f"{prefix}in_proj.weight"].data.size(0) == in_proj_dim, (
             in_proj_dim,
             sharded_state_dict[f"{prefix}in_proj.weight"],
@@ -1377,6 +1415,39 @@ def sharded_state_dict(self, prefix="", sharded_offsets=(), metadata=None):
             0,
         )
 
+        # GTP load-side inverse of the save-time all-gather (see
+        # docs/api-guide/core/generalized_tensor_parallel.md §3.3, in_proj
+        # note): the checkpoint stores the FULL TP-local in_proj.weight (pad stripped) under the
+        # 5 split keys [z|x|B|C|dt], so the default merge_fn cats them back to ``in_proj_dim``
+        # rows with no padding. To reload into the live GTPShardedParam we must mirror init
+        # (``_gtp_slice_one_param``): F.pad the merged tensor with zeros up to
+        # ``gtp_local_size * gtp_size``, then slice by ``gtp_rank``. GTP=1 has no pad/slice.
+        if in_proj_gtp_size > 1 and HAVE_GTP and isinstance(self.in_proj.weight, GTPShardedParam):
+            factory = sharded_state_dict[f"{prefix}in_proj.weight"]
+            gtp_local_rank = torch.distributed.get_rank(self.in_proj.weight.group)
+            gtp_local_size = self.in_proj.weight.data.size(0)
+            original_merge_fn = factory.merge_fn
+
+            @torch.no_grad()
+            def _gtp_slice_after_cat(
+                sub_state_dict,
+                _orig=original_merge_fn,
+                _rank=gtp_local_rank,
+                _size=gtp_local_size,
+                _gtp_size=in_proj_gtp_size,
+            ):
+                full = _orig(sub_state_dict)
+                aligned_total = _size * _gtp_size
+                pad_rows = aligned_total - full.shape[0]
+                if pad_rows > 0:
+                    full = torch.nn.functional.pad(full, (0, 0, 0, pad_rows))
+                start = _rank * _size
+                return full[start : start + _size].contiguous()
+
+            sharded_state_dict[f"{prefix}in_proj.weight"] = replace(
+                factory, merge_fn=_gtp_slice_after_cat
+            )
+
         conv_dim = self.d_inner_local_tp + 2 * self.ngroups_local_tp * self.d_state
         assert sharded_state_dict[f"{prefix}conv1d_weight"].data.size(0) == conv_dim, (
             conv_dim,
diff --git a/megatron/core/tensor_parallel/__init__.py b/megatron/core/tensor_parallel/__init__.py
index 0852014a859..6147ae7b65d 100644
--- a/megatron/core/tensor_parallel/__init__.py
+++ b/megatron/core/tensor_parallel/__init__.py
@@ -10,8 +10,10 @@
     ColumnParallelLinear,
     RowParallelLinear,
     VocabParallelEmbedding,
+    copy_gtp_attributes,
     copy_tensor_model_parallel_attributes,
     linear_with_grad_accumulation_and_async_allreduce,
+    param_is_not_gtp_duplicate,
     param_is_not_tensor_parallel_duplicate,
     set_defaults_if_not_set_tensor_model_parallel_attributes,
     set_tensor_model_parallel_attributes,
@@ -58,7 +60,9 @@
     "set_tensor_model_parallel_attributes",
     "set_defaults_if_not_set_tensor_model_parallel_attributes",
     "copy_tensor_model_parallel_attributes",
+    "copy_gtp_attributes",
     "param_is_not_tensor_parallel_duplicate",
+    "param_is_not_gtp_duplicate",
     "linear_with_grad_accumulation_and_async_allreduce",
     # mappings.py
     "copy_to_tensor_model_parallel_region",
diff --git a/megatron/core/tensor_parallel/generalized_tensor_parallelism.py b/megatron/core/tensor_parallel/generalized_tensor_parallelism.py
new file mode 100644
index 00000000000..abdb60a882c
--- /dev/null
+++ b/megatron/core/tensor_parallel/generalized_tensor_parallelism.py
@@ -0,0 +1,1930 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Generalized Tensor Parallelism (GTP).
+
+Shards weight tensors 1/N across a GTP process group along ``out_features``
+and materializes them on-demand via async all-gather, with a per-weight
+prefetch chain + ticket-based buffer cache co-designed for CUDA graph
+capture/replay.  Quantized AG (FP8 / MXFP8 / NVFP4) composes with the
+sharding for compounding bandwidth reduction.
+
+See ``docs/api-guide/core/generalized_tensor_parallel.md`` for design and usage.
+"""
+
+from __future__ import annotations
+
+import logging
+import math
+import os
+import re
+import warnings
+from collections import defaultdict
+from contextlib import contextmanager, nullcontext
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import Dict, List, Optional
+
+import torch
+from packaging.version import Version
+
+from megatron.core.utils import log_single_rank
+
+logger = logging.getLogger(__name__)
+
+_GTP_TE_MIN_VERSION = Version("2.17")
+
+try:
+    import transformer_engine as te  # noqa: F401
+
+    _te_version = Version(te.__version__)
+    if _te_version < _GTP_TE_MIN_VERSION and not os.environ.get("MEGATRON_GTP_FORCE_ENABLE"):
+        raise ImportError(
+            f"megatron.core.tensor_parallel.gtp requires TransformerEngine "
+            f">= {_GTP_TE_MIN_VERSION} (found {_te_version}). Set MEGATRON_GTP_FORCE_ENABLE=1 "
+            "to bypass this check when using a custom TE build with the GTP hook registry."
+        )
+
+    import transformer_engine_torch as tex
+    from transformer_engine.pytorch.constants import (
+        MXFP8_BLOCK_SCALING_SIZE,
+        NVFP4_BLOCK_SCALING_SIZE,
+    )
+    from transformer_engine.pytorch.distributed import (
+        _NVFP4AllGatherAsyncHandle,
+        gather_along_first_dim,
+        in_fp8_activation_recompute_phase,
+        reduce_scatter_along_first_dim,
+    )
+    from transformer_engine.pytorch.module.base import get_dummy_wgrad
+    from transformer_engine.pytorch.quantized_tensor import QuantizedTensor
+    from transformer_engine.pytorch.tensor import MXFP8TensorStorage, NVFP4TensorStorage
+    from transformer_engine.pytorch.tensor.mxfp8_tensor import MXFP8Quantizer
+    from transformer_engine.pytorch.utils import (
+        nvtx_range_pop,
+        nvtx_range_push,
+        round_up_to_nearest_multiple,
+    )
+
+    HAVE_TE = True
+except (ImportError, ModuleNotFoundError):
+    # TE unavailable/too old -> stub the TE-backed names so this module still imports,
+    # and flag GTP unusable via HAVE_TE (gtp.py surfaces this as HAVE_GTP=False). No
+    # GTP path runs without TE. The `annotations` future-import keeps the lone
+    # module-level TE reference (a dataclass field annotation) from being evaluated.
+    from unittest.mock import MagicMock
+
+    te = tex = MagicMock()
+    MXFP8_BLOCK_SCALING_SIZE = NVFP4_BLOCK_SCALING_SIZE = None
+    _NVFP4AllGatherAsyncHandle = MagicMock()
+    gather_along_first_dim = reduce_scatter_along_first_dim = MagicMock()
+    in_fp8_activation_recompute_phase = MagicMock()
+    get_dummy_wgrad = MagicMock()
+    QuantizedTensor = MagicMock()
+    MXFP8TensorStorage = NVFP4TensorStorage = MagicMock()
+    MXFP8Quantizer = MagicMock()
+    nvtx_range_pop = nvtx_range_push = round_up_to_nearest_multiple = MagicMock()
+    HAVE_TE = False
+
+
+class GTPChain(str, Enum):
+    """Prefetch chain identifier for n GTPShardedParam.
+
+    GRAPHED   — fwd/bwd captured by a CUDA graph (MLM _CudaGraphRunner).
+    UNGRAPHED — fwd/bwd runs eagerly.
+
+    Chains never cross-link (prev_w/next_w stay within one chain). See
+    _classify_param_chain for the GRAPHED/UNGRAPHED rule.
+    """
+
+    GRAPHED = "GTP_graphed"
+    UNGRAPHED = "GTP_ungraphed"
+
+
+# Active cuda_graph config, set by the integrator via set_cuda_graph_modules() before
+# classify_gtp_chains(); consumed by _classify_param_chain.
+_CUDA_GRAPH_MODULES: Optional[set] = None  # scope tags, e.g. {"mamba","attn","moe_router"}
+_MOE_SHARED_EXPERT_OVERLAP: bool = False  # overlapped shared_experts can't be captured -> UNGRAPHED
+_FULL_ITERATION: bool = False  # whole step in one graph -> every param GRAPHED
+# Empty cuda_graph_modules under per-layer CG = "graph every layer" == all tags present.
+_ALL_LAYER_SCOPE_TAGS = frozenset({"mamba", "attn", "moe", "moe_router"})
+
+
+def set_cuda_graph_modules(
+    scope, moe_shared_expert_overlap: bool = False, cuda_graph_impl: str = "none"
+):
+    """Record the active cuda_graph config for GTP chain classification.
+
+    Called by MLM at init, before classify_gtp_chains(). ``cuda_graph_impl``
+    disambiguates the empty-``scope`` cases:
+      - "none"           -> CG disabled; all params UNGRAPHED.
+      - "full_iteration" -> whole step in one graph; all params GRAPHED.
+      - "local"/"transformer_engine" + empty scope -> graph every layer.
+    """
+    global _CUDA_GRAPH_MODULES, _MOE_SHARED_EXPERT_OVERLAP, _FULL_ITERATION
+    _MOE_SHARED_EXPERT_OVERLAP = bool(moe_shared_expert_overlap)
+    _FULL_ITERATION = cuda_graph_impl == "full_iteration"
+    if _FULL_ITERATION:
+        _CUDA_GRAPH_MODULES = None  # scope unused
+    elif cuda_graph_impl != "none" and not scope:
+        _CUDA_GRAPH_MODULES = set(_ALL_LAYER_SCOPE_TAGS)  # graph every layer
+    else:
+        _CUDA_GRAPH_MODULES = set(scope) if scope else None
+
+
+def _classify_param_chain(param_name: str) -> "GTPChain":
+    """Map a GTPShardedParam name + active cuda_graph config to its chain.
+
+    Full-iteration -> GRAPHED. Otherwise embedding/output_layer are UNGRAPHED, and
+    each layer kind (mixer, attention, shared/routed experts) is GRAPHED iff its
+    scope tag is in cuda_graph_modules.
+    """
+    n = param_name
+
+    if _FULL_ITERATION:
+        return GTPChain.GRAPHED
+
+    # embedding/output_layer live outside any per-layer CG runner.
+    if "embedding" in n or "output_layer" in n:
+        return GTPChain.UNGRAPHED
+
+    scope = _CUDA_GRAPH_MODULES
+    if not scope:  # CG disabled
+        return GTPChain.UNGRAPHED
+
+    if ".mlp.shared_experts." in n:
+        if _MOE_SHARED_EXPERT_OVERLAP:
+            return GTPChain.UNGRAPHED
+        return GTPChain.GRAPHED if ("moe" in scope or "moe_router" in scope) else GTPChain.UNGRAPHED
+
+    if ".mlp.experts." in n:
+        return GTPChain.GRAPHED if "moe" in scope else GTPChain.UNGRAPHED
+
+    if ".self_attention." in n or ".cross_attention." in n:
+        return GTPChain.GRAPHED if "attn" in scope else GTPChain.UNGRAPHED
+
+    if ".mixer." in n:
+        return GTPChain.GRAPHED if "mamba" in scope else GTPChain.UNGRAPHED
+
+    return GTPChain.UNGRAPHED
+
+
+def classify_gtp_chains(model) -> None:
+    """Walk model.named_parameters() and set chain_id on every GTPShardedParam.
+
+    Call once at init, AFTER set_cuda_graph_modules() and BEFORE the first fwd of any
+    graphed param. Raises if an already-initialized param would be reclassified into a
+    different chain (its prev/next links are already wired into the wrong list).
+    """
+    conflicts = []
+    for name, param in model.named_parameters():
+        if not isinstance(param, GTPShardedParam):
+            continue
+        target = _classify_param_chain(name).value
+        if param.prefetch_initialized and param.chain_id != target:
+            conflicts.append((name, param.chain_id, target))
+            continue
+        param.chain_id = target
+
+        # Bwd-prefetch opt-out: embedding weight needs no bwd AG (wgrad is a
+        # scatter-add on sharded rows, input has no dgrad) — saves one collective.
+        if "embedding" in name:
+            param._need_weight_prefetch_bwd = False
+    if conflicts:
+        raise RuntimeError(
+            "classify_gtp_chains: the following params were already chain-initialized "
+            "with a different chain_id than the classifier would assign — this means "
+            "their chain links are already wired into the wrong list. Move classification "
+            "earlier in init. Conflicts: "
+            + ", ".join(f"{n}: {old!r}->{new!r}" for n, old, new in conflicts[:3])
+            + ("..." if len(conflicts) > 3 else "")
+        )
+
+
+class GTPWeightState(Enum):
+    """State of a GTPShardedParam's AG / RS lifecycle (debug / stale-read guard)."""
+
+    NONE = "NONE"  # Sharded, no pending operation
+    ASYNC_WAIT = "ASYNC_WAIT"  # Async all-gather in progress
+    DATA_READY = "DATA_READY"  # Async all-gather complete, result in cache
+    DATA_READY_SYNC = "DATA_READY_SYNC"  # Sync all-gather complete, result in cache
+
+
+# Global GTP buffer cache (persists across clear(); never set to None after creation).
+_GTP_CACHE = None
+_GTP_PARAMS = []
+
+# Global set of GTPShardedParam with in-flight async comms (AG or RS).
+_inflight_comm_params: set = set()
+_AG_STREAMS: Dict[str, torch.cuda.Stream] = {}
+_RS_STREAMS: Dict[str, torch.cuda.Stream] = {}
+
+# Wgrad input buffer pool, keyed by (shape, dtype). UNGRAPHED-only: GRAPHED
+# wgrad bufs need address stability for CG replay and are not pool-recycled.
+_wgrad_buf_pool: Dict[tuple, list] = {}
+
+
+def _wgrad_pool_get(shape: tuple, dtype: torch.dtype, device) -> torch.Tensor:
+    """Get a pool buffer or allocate fresh, tagged so _wgrad_pool_put accepts only
+    pool-owned buffers (other callers fall through to the caching allocator on release)."""
+    key = (shape, dtype)
+    pool = _wgrad_buf_pool.get(key)
+    if pool:
+        buf = pool.pop()
+    else:
+        buf = torch.empty(shape, dtype=dtype, device=device, requires_grad=False)
+    buf._from_gtp_wgrad_pool = True
+    return buf
+
+
+def _wgrad_pool_put(buf: torch.Tensor):
+    """Return a pool-owned buffer for reuse (no-op for untagged buffers; see
+    _wgrad_pool_get)."""
+    if not getattr(buf, "_from_gtp_wgrad_pool", False):
+        return
+    key = (tuple(buf.shape), buf.dtype)
+    if key not in _wgrad_buf_pool:
+        _wgrad_buf_pool[key] = []
+    _wgrad_buf_pool[key].append(buf)
+
+
+def _stream_key(chain_id: str, group) -> tuple:
+    """Key for the per-(chain, group) AG/RS stream dicts.
+
+    Partitioned on two axes: chain_id (captured GRAPHED vs eager UNGRAPHED ops must not
+    share a stream) and group (independent NCCL comms, e.g. GTP vs EGTP, avoid serialization).
+    """
+    return (chain_id, id(group) if group is not None else 0)
+
+
+def get_ag_stream(chain_id: str = GTPChain.GRAPHED.value, group=None) -> torch.cuda.Stream:
+    """Return the GTP all-gather stream for (chain_id, group). See _stream_key."""
+    key = _stream_key(chain_id, group)
+    if key not in _AG_STREAMS:
+        _AG_STREAMS[key] = torch.cuda.Stream()
+    return _AG_STREAMS[key]
+
+
+def get_rs_stream(chain_id: str = GTPChain.GRAPHED.value, group=None) -> torch.cuda.Stream:
+    """Return the GTP reduce-scatter stream for (chain_id, group). See _stream_key."""
+    key = _stream_key(chain_id, group)
+    if key not in _RS_STREAMS:
+        _RS_STREAMS[key] = torch.cuda.Stream()
+    return _RS_STREAMS[key]
+
+
+def wait_for_gtp_grad_reduction_on_current_stream() -> None:
+    """Fence the current stream against all GTP backward grad work before the DP gradient sync.
+
+    Drains the eager AG/RS side streams, then waits on each CG runner's replay stream
+    (its tail = captured Phase 2 main_grad.add_). No-op when GTP is inactive.
+    """
+    wait_async_comms()
+    cur = torch.cuda.current_stream()
+    for s in _AG_STREAMS.values():
+        cur.wait_stream(s)
+    for s in _RS_STREAMS.values():
+        cur.wait_stream(s)
+    # Local import: cuda_graphs imports this module, so a module-level import would be circular.
+    from megatron.core.transformer.cuda_graphs import get_gtp_runner_streams
+
+    for s in get_gtp_runner_streams():
+        cur.wait_stream(s)
+
+
+@dataclass
+class GTPConfig:
+    """Global configuration for Generalized Tensor Parallelism."""
+
+    pad_for_alignment: int = 16
+    check_param_states: bool = False
+    weight_prefetch: bool = True
+    # True (default): non-chain-head wgrad RS is async_op=True and finalizes
+    # (handle.wait + main_grad.add_) in a later bwd's cascade walk, overlapping RS with
+    # compute. False: every wgrad RS is synchronous + inline (no overlap).
+    async_reduction: bool = True
+    # GTP companion to --fp8-param-gather: optimizer casts FP32 master directly into
+    # GTPShardedParam.quantized; forward reuses the cached FP8 (BF16->FP8 off critical path).
+    fp8_param_gather: bool = False
+
+
+GTP_CONFIG = GTPConfig()
+
+
+def update_gtp_config(**kwargs):
+    """Update the global GTP configuration."""
+    for key, value in kwargs.items():
+        if not hasattr(GTP_CONFIG, key):
+            raise ValueError(f"Unknown GTP config option: {key}")
+        setattr(GTP_CONFIG, key, value)
+
+
+def tag_gtp_params_with_names(model):
+    """Populate _debug_name on every GTPShardedParam with its full dotted parameter name.
+
+    Call once after model construction so the linking log prints human-readable names
+    instead of raw tensor ids.
+    """
+    for name, param in model.named_parameters():
+        if isinstance(param, GTPShardedParam):
+            param._debug_name = name
+
+
+def _gtp_slice_one_param(param, gtp_group, *, name="<unnamed>"):
+    """Pad + slice a full-size BF16 weight to this rank's GTP shard.
+
+    Caller attaches GTP attrs (see _gtp_attach_attrs). On the legacy post-init path under
+    fp8_model_init, tensor may be a QuantizedTensor — F.pad dequantizes it before slicing.
+    """
+    gtp_size = gtp_group.size()
+    gtp_rank = gtp_group.rank()
+    tensor = param.data
+
+    if GTP_CONFIG.pad_for_alignment > 0:
+        # Pad before slicing so shards stay alignment-divisible and padding
+        # ends up contiguous at the tail of the gathered result.
+        alignment = GTP_CONFIG.pad_for_alignment * gtp_size
+        dim0 = tensor.shape[0]
+        pad_length = (alignment - dim0 % alignment) % alignment
+        if pad_length > 0:
+            tensor = torch.nn.functional.pad(tensor, (0, 0, 0, pad_length))
+    else:
+        # No-pad mode: dim-0 must divide gtp_size or AG output loses tail rows.
+        assert tensor.shape[0] % gtp_size == 0, (
+            f"_gtp_slice_one_param: {name}.shape[0]={tensor.shape[0]} is not "
+            f"divisible by gtp_size={gtp_size}. Either enable padding by "
+            "setting GTP_CONFIG.pad_for_alignment > 0, or ensure the weight's "
+            "dim-0 is a multiple of the GTP group size."
+        )
+        pad_length = 0
+
+    shard_size = tensor.shape[0] // gtp_size
+    shard = tensor[gtp_rank * shard_size : (gtp_rank + 1) * shard_size]
+    gtp_shard = GTPShardedParam(shard.clone())
+    gtp_shard.pad_length = pad_length
+    # Preserve the source weight's TP attributes (dropped when wrapping into GTPShardedParam),
+    # so param_is_not_tensor_parallel_duplicate still classifies it without GTP-specific code.
+    from megatron.core.tensor_parallel import copy_tensor_model_parallel_attributes
+
+    copy_tensor_model_parallel_attributes(gtp_shard, param)
+    return gtp_shard
+
+
+def _gtp_attach_attrs(gtp_shard, gtp_group, *, is_grouped=False, expert_idx=0):
+    """Attach group / gtp_size / routed-expert tags and register in _GTP_PARAMS.
+
+    Separate from _gtp_slice_one_param so attrs land on the post-quantize param (when
+    quantize fires between slice and attach).
+    """
+    if is_grouped:
+        gtp_shard.expert_idx = expert_idx
+        gtp_shard.is_routed_expert = True
+        # Default to UNGRAPHED; classify_gtp_chains() reclassifies based on the
+        # cuda_graph_modules at init time.
+        gtp_shard.chain_id = GTPChain.UNGRAPHED.value
+    gtp_shard.group = gtp_group
+    gtp_shard.gtp_size = gtp_group.size()
+    global _GTP_PARAMS
+    _GTP_PARAMS.append(gtp_shard)
+
+
+def wrap_module_params_gtp(module, weight_names, gtp_group, is_grouped=None):
+    """Shard and re-register module params as GTPShardedParam.
+
+    Two call paths: (1) Megatron-style modules (ColumnParallelLinear, etc.) — full post-init
+    slice; (2) TE modules — per-param body no-ops, since the reset_parameters hook already
+    produced GTPShardedParam instances.
+    """
+    if gtp_group.size() == 1:
+        return
+
+    for idx, name in enumerate(weight_names):
+        param = getattr(module, name, None)
+        if param is None:
+            continue
+
+        # TE-side hook already sliced this one.
+        if isinstance(param, GTPShardedParam):
+            continue
+
+        # delete the original parameter, which will be replaced by an GTP sharded one
+        delattr(module, name)
+        gtp_shard = _gtp_slice_one_param(param, gtp_group, name=name)
+        del param
+        _gtp_attach_attrs(gtp_shard, gtp_group, is_grouped=bool(is_grouped), expert_idx=idx)
+        # register the newly sharded param back to the module
+        module._parameters[name] = gtp_shard
+
+    if is_grouped:
+        allweights = [getattr(module, name) for name in weight_names]
+        allweights[0].weight_list = allweights
+
+
+def gtp_slice_in_reset_parameters(module, name, param, expert_idx=0):
+    """Slice + attach attrs for one param, between init_fn(param) and the optional
+    quantizer(param) in TransformerEngineBaseModule.reset_parameters.
+
+    Only fires for params in module.weight_names (GEMM weights); layer-norm gammas, biases,
+    etc. stay full-size. Returns the new GTPShardedParam, or None (GTP not active here).
+    """
+    gtp_group = getattr(module, "_gtp_group", None)
+    if gtp_group is None or gtp_group.size() == 1:
+        return None
+    weight_names = getattr(module, "weight_names", None)
+    if weight_names is None or name not in weight_names:
+        return None
+    is_grouped = bool(getattr(module, "_gtp_is_grouped", False))
+    gtp_shard = _gtp_slice_one_param(param, gtp_group, name=name)
+    _gtp_attach_attrs(gtp_shard, gtp_group, is_grouped=is_grouped, expert_idx=expert_idx)
+    return gtp_shard
+
+
+def gtp_finalize_module_in_reset_parameters(module, weight_names):
+    """GroupedLinear-only: attach weight_list to expert 0's shard for batched all-gather
+    (no-op when module._gtp_is_grouped is False)."""
+    if not getattr(module, "_gtp_is_grouped", False):
+        return
+    gtp_group = getattr(module, "_gtp_group", None)
+    if gtp_group is None or gtp_group.size() == 1:
+        return
+    allweights = [getattr(module, n) for n in weight_names]
+    if allweights:
+        allweights[0].weight_list = allweights
+
+
+class GTPShardHandle:
+    """Wrapper around a ``dist`` async-work handle for a GTP AG / RS.
+
+    Tracks the participating shards so the wait-site can transition their GTPWeightState
+    and prune the param from _inflight_comm_params when the collective completes.
+    """
+
+    def __init__(self, handle, gtp_shards, reduce_scatter=False):
+        self.handle = handle
+        self.gtp_shards = gtp_shards
+        self.reduce_scatter = reduce_scatter
+        _inflight_comm_params.add(gtp_shards[0])
+
+    def wait(self):
+        """Wait on the underlying NCCL work and update the shards' state."""
+        if self.handle is not None:
+            self.handle.wait()
+            self.handle = None  # Release NCCL Work and its C++ tensor references promptly
+        if GTP_CONFIG.check_param_states:
+            for w in self.gtp_shards:
+                if self.reduce_scatter:
+                    w._set_rs_state(GTPWeightState.DATA_READY)
+                else:
+                    w._set_state(GTPWeightState.DATA_READY)
+
+        _inflight_comm_params.discard(self.gtp_shards[0])
+
+
+class GTPShardedParam(torch.nn.Parameter):
+    """A weight parameter sharded 1/N across a GTP process group.
+
+    Materialized on-demand via async all-gather and gradient-reduced via reduce-scatter.
+    Carries its own prefetch-chain wiring (prev_w/next_w), per-chain state, AG/RS cache
+    tickets, and the metadata the integrator needs to overlap with captured compute.
+    """
+
+    # Per-chain linked-list state, keyed by chain_id; chains never cross-link (prev_w/next_w join
+    # only same-chain params). Call reset_gtp_state() before rebuilding a GTP model in-process.
+    _chain_state: Dict[str, dict] = {}
+
+    # Recompute-forward prefetch cursor, keyed by chain_id; also cleared by reset_gtp_state().
+    _recompute_chain_state: Dict[str, dict] = {}
+
+    @classmethod
+    def _get_chain_state(cls, chain_id: str) -> dict:
+        if chain_id not in cls._chain_state:
+            cls._chain_state[chain_id] = {
+                "last_weight": None,
+                "link_node_count": 0,
+                "link_table_buffer": [],
+                "link_table_flushed": False,
+            }
+        return cls._chain_state[chain_id]
+
+    @classmethod
+    def _get_recompute_chain_state(cls, chain_id: str) -> dict:
+        if chain_id not in cls._recompute_chain_state:
+            cls._recompute_chain_state[chain_id] = {"last_weight": None}
+        return cls._recompute_chain_state[chain_id]
+
+    @classmethod
+    def _buffer_link_table_row(
+        cls, prev: "GTPShardedParam", curr: "GTPShardedParam", chain: dict
+    ) -> None:
+        """Buffer one prefetch-link row (flushed atomically on the second forward pass)."""
+        _W = 70
+
+        def _layer_id(name: str) -> str:
+            m = re.search(r"\d+", name)
+            return m.group() if m else "-"
+
+        chain["link_node_count"] += 1
+        if chain["link_node_count"] == 1:
+            chain_id = getattr(curr, "chain_id", GTPChain.UNGRAPHED.value)
+            chain["link_table_buffer"].append(
+                f"\n[{chain_id} chain]\n{'node_id':>7} | {'layer_id':>8} |"
+                f" {'curr_weight_name':<{_W}} |"
+                f" prev_weight_name\n{'-'*7}-+-{'-'*8}-+-{'-'*_W}-+-{'-'*_W}"
+            )
+            # Seed weight (first GTP param) as row 0
+            chain["link_table_buffer"].append(
+                f"{'0':>7} | {_layer_id(prev._debug_name):>8} | {prev._debug_name:<{_W}} | -"
+            )
+        chain["link_table_buffer"].append(
+            f"{chain['link_node_count']:>7} | {_layer_id(curr._debug_name):>8} | "
+            f"{curr._debug_name:<{_W}} | {prev._debug_name}"
+        )
+
+    @staticmethod
+    def __new__(cls, tensor, *args, **kwargs):  # pylint: disable=unused-argument
+        requires_grad = kwargs.get("requires_grad", True)
+        # pylint: disable-next=unexpected-keyword-arg
+        return super(GTPShardedParam, cls).__new__(cls, tensor, requires_grad=requires_grad)
+
+    def __init__(self, tensor, *args, **kwargs):
+        del tensor, args, kwargs
+        super().__init__()
+
+        # Canonical flag — also set on distopt's main_param copy so both kinds
+        # of param can be classified via a single attribute check.
+        self.is_gtp = True
+
+        # all gather
+        self.state = GTPWeightState.NONE
+        self._ag_ticket_fwd = None
+        self._ag_ticket_bwd = None
+        self._prefetch_handle = None
+        self._need_weight_prefetch = True
+        # Per-direction prefetch opt-outs (default True). The embedding weight needs no bwd AG
+        # (wgrad is a token-indexed scatter-add, input non-differentiable). classify_gtp_chains()
+        # sets this False for embedding.word_embeddings.weight.
+        self._need_weight_prefetch_bwd = True
+        self.ag_event = torch.cuda.Event(external=True)
+        # DDP backward hook (set by register_grad_accum_hook); invoked after
+        # the wgrad RS accumulation completes (Graphed.backward / chain cascade).
+        self._grad_accum_hook = None
+        # Quantization
+        self._quantizer = None
+        self.did_cast_to_low_precision = False
+        self.quantized = None
+        # Prefetching linked list
+        self.prefetch_initialized = False
+        self.next_w = None
+        self.prev_w = None
+        # Recompute-forward prefetch chain: a SEPARATE chain (own slot) for weights re-gathered
+        # rowwise during an activation-recompute forward in backward. Distinct from the
+        # state/_prefetch_handle/ag_event above so it never clobbers the concurrent columnwise
+        # dgrad lifecycle. Self-populates from the first backward's recompute gathers.
+        self._recompute_initialized = False
+        self._recompute_next = None
+        self._recompute_prev = None
+        self._recompute_prefetch_handle = None
+        self._recompute_ag_event = torch.cuda.Event(external=True)
+        self._recompute_already_drained = False
+        # Chain identity (GRAPHED/UNGRAPHED). Defaults to UNGRAPHED; classify_gtp_chains(model)
+        # walks the model at init (after set_cuda_graph_modules) and reclassifies on param name +
+        # active cuda_graph_modules.
+        self.chain_id = GTPChain.UNGRAPHED.value
+        # Grouped gemm
+        self.is_routed_expert = False
+        self.expert_idx = None
+        self.group = None
+        self.weight_list = None
+        # Reduce-scatter state (set during wgrad_reduce_scatter)
+        self.rs_state = GTPWeightState.NONE
+        self._wgrad_rs_handle = None
+        self.rs_event = torch.cuda.Event(external=True)
+        self._rs_ticket = None
+        # Padding
+        self.pad_length = 0
+        # Debug
+        self._debug_name = ""
+        # Hot-path caches (populated lazily on first use).  chain_id/group are
+        # set after __init__, so we can't resolve streams eagerly here.
+        self._cached_ag_stream = None
+        self._cached_rs_stream = None
+        self._cached_quantizers = None
+        self._cached_dtypes = None
+        self._cached_gtp_group = None
+
+    def setup(self, weight_quantizer=None):
+        """Set quantizer and create quantized shard."""
+
+        if self._quantizer is None:
+
+            def _configure_quantizer(q, group):
+                q = q.copy()
+                if hasattr(q, "with_amax_reduction"):
+                    q.with_amax_reduction = True
+                    q.amax_reduction_group = group
+                q.internal = False
+                # MXFP8 scales must stay compact (unswizzled) so per-shard scale_inv can be
+                # all-gathered by byte concatenation. GEMM-swizzled scales from independent
+                # shards don't compose into a valid swizzled layout for the full tensor.
+                q.optimize_for_gemm = not isinstance(q, MXFP8Quantizer)
+                return q
+
+            weights = (
+                self.weight_list
+                if self.is_routed_expert and self.weight_list is not None
+                else [self]
+            )
+            for quantizer, weight in zip(weight_quantizer, weights):
+                if quantizer is None:
+                    continue
+
+                weight._quantizer = _configure_quantizer(quantizer, weight.group)
+                # This init quantize is the only allocation of the quantized storage
+                # (re-quantize writes in place), so route it via _graphed_alloc.
+                with _graphed_alloc(getattr(weight, "chain_id", GTPChain.UNGRAPHED.value)):
+                    weight.quantized = weight._quantizer.quantize(weight.get_padded_shard())
+                weight.quantized.is_routed_expert = getattr(weight, "is_routed_expert", False)
+                # fp8_param_gather: the init quantize already produced a valid FP8 cache from
+                # the BF16 shard; flag did_cast so iter-0 forward short-circuits and skips the
+                # redundant BF16->FP8 cast.
+                if GTP_CONFIG.fp8_param_gather:
+                    weight.did_cast_to_low_precision = True
+
+    @property
+    def _weights(self):
+        """Individual weight shards (self for non-routed, weight_list for routed)."""
+        weights = self.weight_list if self.is_routed_expert else [self]
+        # Only meaningful when _set_state is actively tracking transitions.
+        if GTP_CONFIG.check_param_states:
+            assert all(w.state == weights[0].state for w in weights)
+        return list(weights)
+
+    @property
+    def _unsharded_shape_padded(self):
+        """Full unsharded shape *including* the pad rows on the last rank."""
+        out_shape = list(self.size())
+        out_shape[0] = out_shape[0] * self.group.size()
+        return tuple(out_shape)
+
+    @property
+    def _unsharded_shape(self):
+        """Full unsharded shape with the pad rows stripped (logical shape)."""
+        out_shape = list(self._unsharded_shape_padded)
+        out_shape[0] -= self.pad_length
+        return tuple(out_shape)
+
+    @property
+    def _sharded_padded_shape(self):
+        """This rank's local shard shape, padding included."""
+        return tuple(self.size())
+
+    def get_padded_shard(self):
+        """Return the local shard already containing its share of padding (identity)."""
+        return self
+
+    def _set_state(self, new_state: GTPWeightState):
+        """Advance the AG state (only inspected when ``check_param_states`` is on)."""
+        # Only inspected when check_param_states is on; skip writes otherwise.
+        if not GTP_CONFIG.check_param_states:
+            return
+        self.state = new_state
+
+    def _set_rs_state(self, new_state: GTPWeightState):
+        """Advance the RS state (only inspected when ``check_param_states`` is on)."""
+        if not GTP_CONFIG.check_param_states:
+            return
+        self.rs_state = new_state
+
+    def _get_cache_key(self, dtype, fwd: bool, reduce_scatter: bool) -> tuple:
+        """Build cache key from output shape + dtype.
+
+        Weights with matching gathered shape and dtype share a buffer. For experts gathered
+        in parallel, self.expert_idx keeps each distinct; same-indexed experts across layers share.
+        """
+
+        if not isinstance(dtype, torch.dtype):
+            return (
+                self._unsharded_shape_padded,
+                dtype,
+                fwd,
+                not fwd,
+                self.expert_idx,
+                reduce_scatter,
+            )
+        return (self._unsharded_shape_padded, dtype, self.expert_idx, reduce_scatter)
+
+    def _quantize_if_needed(self, skip_weight_cast=False, cast_noop_flag=None):
+        """Re-quantize sharded weight into existing buffer. Returns quantized weight or self."""
+        if self._quantizer is None:
+            self.did_cast_to_low_precision = False
+            return self
+
+        # fp8_param_gather fast-path: optimizer already filled self.quantized;
+        # reuse it and keep BF16->FP8 off the forward critical path.
+        if GTP_CONFIG.fp8_param_gather and self.did_cast_to_low_precision:
+            return self.quantized
+
+        self._quantizer.set_usage(rowwise=True, columnwise=True)
+        if skip_weight_cast is False or cast_noop_flag is not None:
+            tex.quantize(
+                tensor=self.get_padded_shard(),
+                quantizer=self._quantizer,
+                output=self.quantized,
+                noop=cast_noop_flag,
+            )
+        self.did_cast_to_low_precision = True
+
+        return self.quantized
+
+    def _strip_padding(self, tensor):
+        if self.pad_length == 0:
+            return tensor
+
+        if isinstance(tensor, QuantizedTensor):
+            assert isinstance(
+                tensor, (NVFP4TensorStorage, MXFP8TensorStorage)
+            ), f"Unsupported quantized tensor type for GTP padding: {type(tensor)}"
+
+            metadata = tensor.get_metadata()
+            if metadata.get("rowwise_data") is not None:
+                metadata["rowwise_data"] = metadata["rowwise_data"][: -self.pad_length]
+            if metadata.get("columnwise_data") is not None:
+                if isinstance(tensor, NVFP4TensorStorage):
+                    # NVFP4 transposes columnwise and packs 2 values per byte
+                    metadata["columnwise_data"] = metadata["columnwise_data"][
+                        ..., : -self.pad_length // 2
+                    ].contiguous()
+                else:
+                    # MXFP8 columnwise is not transposed, strip first dim
+                    metadata["columnwise_data"] = metadata["columnwise_data"][: -self.pad_length]
+            M = self._unsharded_shape[0]
+            if isinstance(tensor, NVFP4TensorStorage):
+                # NVFP4 scale_inv shapes (see NVFP4Quantizer.get_scale_shape):
+                #   rowwise_scale_inv:    [round_up(M, 128),  round_up(ceil(K/16), 4)]
+                #   columnwise_scale_inv: [round_up(K, 128),  round_up(ceil(M/16), 4)]
+                # GTP shards M (dim 0 of the weight), so strip to the unpadded sizes.
+                if metadata.get("rowwise_scale_inv") is not None:
+                    m_rows = round_up_to_nearest_multiple(M, 128)
+                    metadata["rowwise_scale_inv"] = metadata["rowwise_scale_inv"][:m_rows]
+                if metadata.get("columnwise_scale_inv") is not None:
+                    m_tiles = round_up_to_nearest_multiple(
+                        math.ceil(M / NVFP4_BLOCK_SCALING_SIZE), 4
+                    )
+                    metadata["columnwise_scale_inv"] = metadata["columnwise_scale_inv"][
+                        :, :m_tiles
+                    ].contiguous()
+            else:
+                # MXFP8 scale_inv shapes (see MXFP8Quantizer.get_scale_shape):
+                #   rowwise_scale_inv:    [round_up(M, 128),     round_up(K//32, 4)]
+                #   columnwise_scale_inv: [round_up(M//32, 4),   round_up(K, 128)]
+                # GTP shards M (dim 0 of the weight), so strip to the unpadded sizes.
+                if metadata.get("rowwise_scale_inv") is not None:
+                    m_rows = round_up_to_nearest_multiple(M, 128)
+                    metadata["rowwise_scale_inv"] = metadata["rowwise_scale_inv"][:m_rows]
+                if metadata.get("columnwise_scale_inv") is not None:
+                    m_tiles = round_up_to_nearest_multiple(M // MXFP8_BLOCK_SCALING_SIZE, 4)
+                    metadata["columnwise_scale_inv"] = metadata["columnwise_scale_inv"][:m_tiles]
+
+            return type(tensor)(**metadata, shape=self._unsharded_shape, dtype=torch.bfloat16)
+
+        return tensor[: -self.pad_length]
+
+    def _all_gather_weight(self, async_op, skip_weight_cast, cast_noop_flag, fwd, nvtx_label=None):
+        """Quantize (if needed) and all-gather weight. Returns (weight_total, handle)."""
+        if nvtx_label is None:
+            nvtx_label = (
+                self._debug_name + (".fwd" if fwd else ".bwd") + (".async" if async_op else ".sync")
+            )
+        nvtx_range_push(f"{nvtx_label}.all_gather_weight")
+
+        weights = self._weights
+
+        # 1. Transition state for async gathers. Skip during recompute-forward: it gathers
+        #    rowwise (_ag_ticket_fwd) while a bwd-chain prefetch may hold an in-flight columnwise
+        #    AG state (_ag_ticket_bwd) on the same weight — clobbering breaks the dgrad consume.
+        if GTP_CONFIG.check_param_states and not in_fp8_activation_recompute_phase():
+            new_state = GTPWeightState.ASYNC_WAIT if async_op else GTPWeightState.DATA_READY_SYNC
+            for w in weights:
+                w._set_state(new_state)
+
+        # 2. Prepare: quantize, set usage direction.
+        fp8_pg_hit = GTP_CONFIG.fp8_param_gather and self.did_cast_to_low_precision
+
+        if not fp8_pg_hit:
+            for w in weights:
+                w._quantize_if_needed(skip_weight_cast, cast_noop_flag)
+
+        for w in weights:
+            if w.did_cast_to_low_precision:
+                w._quantizer.set_usage(rowwise=fwd, columnwise=not fwd)
+
+        # 3. Build gather inputs.
+        # quantizers / dtypes / gtp_group are stable post-construction — cache on the anchor
+        # (self == weights[0]) to avoid rebuilding lists each call. w.quantized is NOT cached
+        # (it can rebind).
+        quantizers = self._cached_quantizers
+        if quantizers is None:
+            quantizers = [w._quantizer for w in weights]
+            self._cached_quantizers = quantizers
+        if weights[0].did_cast_to_low_precision:
+            gather_weights = [w.quantized for w in weights]
+        else:
+            gather_weights = list(w.get_padded_shard() for w in weights)
+
+        # 4. Cache checkout — use pooled buffers for both async and sync gathers
+        #    to avoid allocating fresh memory each iteration.
+        dtypes = self._cached_dtypes
+        if dtypes is None:
+            dtypes = [q.dtype if q is not None else w.dtype for q, w in zip(quantizers, weights)]
+            self._cached_dtypes = dtypes
+        out_buffers = []
+        cache = get_global_GTP_cache()
+        for p, dt in zip(weights, dtypes):
+            if fwd:
+                if p._ag_ticket_fwd is None:
+                    p._ag_ticket_fwd = cache.reserve(p, dt, fwd=True)
+                    cache.get(p._ag_ticket_fwd)
+                    cache.release(p._ag_ticket_fwd)
+                out_buffers.append(cache.get(p._ag_ticket_fwd))
+            else:
+                if p._ag_ticket_bwd is None:
+                    p._ag_ticket_bwd = cache.reserve(p, dt, fwd=False)
+                out_buffers.append(cache.get(p._ag_ticket_bwd))
+
+        # 5. Communicate.
+        gtp_group = self._cached_gtp_group
+        if gtp_group is None:
+            gtp_group = weights[0].group
+            self._cached_gtp_group = gtp_group
+        if GTP_CONFIG.check_param_states and len(gather_weights) > 1:
+            # Debug invariant: batched AG needs distinct output buffers per expert.
+            assert len(set(id(b) for b in out_buffers)) == len(
+                out_buffers
+            ), "Duplicate output buffers in batched all-gather — experts need distinct cache keys"
+
+        # ASYNC AG: issue on ag_stream so its tail reflects the collective's full lifecycle
+        # (what external wait_stream(ag_stream) drains depend on). The explicit outer→ag_stream
+        # sync event preserves the upstream quantize-writer edge the bare stream context drops;
+        # held on self so the event pool can't recycle it between capture and replay.
+        # SYNC AG: stay on caller — output ready on return.
+        if async_op:
+            outer_stream = torch.cuda.current_stream()
+            ag_stream = get_ag_stream(self.chain_id, gtp_group)
+            if getattr(self, "_ag_outer_sync_event", None) is None:
+                self._ag_outer_sync_event = torch.cuda.Event()
+            outer_sync_event = self._ag_outer_sync_event
+            outer_sync_event.record(outer_stream)
+            ag_stream.wait_event(outer_sync_event)
+            ag_ctx = torch.cuda.stream(ag_stream)
+        else:
+            ag_ctx = nullcontext()
+
+        with ag_ctx:
+            if len(gather_weights) > 1:
+                nvtx_range_push(f"{nvtx_label}.batched_gtp_ag")
+                results, handle = grouped_gather_along_first_dim(
+                    gather_weights,
+                    gtp_group,
+                    async_op=async_op,
+                    quantizers=quantizers,
+                    output_tensors=out_buffers,
+                )
+                nvtx_range_pop(f"{nvtx_label}.batched_gtp_ag")
+            else:
+                nvtx_range_push(f"{nvtx_label}.gtp_ag")
+                weight_total, handle = gather_along_first_dim(
+                    gather_weights[0],
+                    gtp_group,
+                    quantizer=quantizers[0],
+                    async_op=async_op,
+                    output_tensor=out_buffers[0] if out_buffers is not None else None,
+                )
+                nvtx_range_pop(f"{nvtx_label}.gtp_ag")
+                results = [weight_total]
+
+        result = results if self.is_routed_expert else results[0]
+
+        # 6. Wrap handle.
+        if async_op:
+            handle = GTPShardHandle(handle, weights)
+        else:
+            handle = None
+
+        nvtx_range_pop(f"{nvtx_label}.all_gather_weight")
+        return result, handle
+
+    def _wait_param_gather(self):
+        # Enter ag_stream context so handle.wait() + ag_event.record() both
+        # land on ag_stream. That makes ag_event mark ag_stream's tail, which
+        # is what external drains via wait_stream(ag_stream) actually block on.
+        ag_stream = self._cached_ag_stream
+        if ag_stream is None:
+            ag_stream = get_ag_stream(self.chain_id, self.group)
+            self._cached_ag_stream = ag_stream
+        with torch.cuda.stream(ag_stream):
+            if self._prefetch_handle is not None:
+                self._prefetch_handle.wait()
+                self._prefetch_handle = None
+                self.ag_event.record()
+
+    def _all_gather_weight_on_demand(self, fwd, skip_weight_cast=False, cast_noop_flag=None):
+        result, _ = self._all_gather_weight(
+            async_op=False,
+            skip_weight_cast=skip_weight_cast,
+            cast_noop_flag=cast_noop_flag,
+            fwd=fwd,
+        )
+        result = result if self.is_routed_expert else [result]
+        result = [self._strip_padding(r) for r in result]
+        result = [r.detach().requires_grad_(w.requires_grad) for r, w in zip(result, self._weights)]
+        return result if self.is_routed_expert else result[0]
+
+    def _get_prefetched_weight(self, fwd, skip_weight_cast=False, cast_noop_flag=None):
+        # ``skip_weight_cast`` and ``cast_noop_flag`` are accepted to keep the
+        # signature symmetric with ``_all_gather_weight_on_demand``.
+        del skip_weight_cast, cast_noop_flag
+        # Stale-read guard: state must reflect an AG issued for this cycle;
+        # otherwise cache.get() would return the prior iter's AG buffer.
+        if GTP_CONFIG.check_param_states:
+            for w in self._weights:
+                assert w.state in (
+                    GTPWeightState.ASYNC_WAIT,
+                    GTPWeightState.DATA_READY,
+                    GTPWeightState.DATA_READY_SYNC,
+                ), (
+                    f"[GTP] _get_prefetched_weight({'fwd' if fwd else 'bwd'}) on "
+                    f"{self._debug_name} with state={w.state!r} — no AG issued; "
+                    "cache.get() would return stale data. Check the chain's "
+                    "_need_weight_prefetch flag and issuer's prefetch logic."
+                )
+        _was_drained = getattr(self, "_already_ag_drained", False)
+        if _was_drained:
+            # Producer already drained via wait_async_comms; skip the captured cross-graph
+            # wait (a CUDA no-op anyway). Correctness comes from the eager main_stream sync.
+            self._already_ag_drained = False
+        else:
+            # Intra-graph or eager consume: drain inline.
+            self._wait_param_gather()
+            self.ag_event.wait()
+
+        # Retrieve prefetched results from cache
+        result = []
+        cache = get_global_GTP_cache()
+        for w in self._weights:
+            ticket = w._ag_ticket_fwd if fwd else w._ag_ticket_bwd
+            result.append(cache.get(ticket))
+
+        result = [self._strip_padding(r) for r in result]
+
+        result = [r.detach().requires_grad_(w.requires_grad) for r, w in zip(result, self._weights)]
+        return result if self.is_routed_expert else result[0]
+
+    def _wait_recompute_param_gather(self):
+        # Recompute-chain analogue of _wait_param_gather, on the _recompute_* slot.
+        ag_stream = self._cached_ag_stream
+        if ag_stream is None:
+            ag_stream = get_ag_stream(self.chain_id, self.group)
+            self._cached_ag_stream = ag_stream
+        with torch.cuda.stream(ag_stream):
+            if self._recompute_prefetch_handle is not None:
+                self._recompute_prefetch_handle.wait()
+                self._recompute_prefetch_handle = None
+                self._recompute_ag_event.record()
+
+    def _recompute_prefetch_next(self, target, nvtx_label=None):
+        # Issue target's rowwise (fwd) AG into its recompute slot. _all_gather_weight skips the
+        # AG-state transition under recompute, so target's dgrad state is untouched; result lands
+        # in target._ag_ticket_fwd.
+        _, handle = target._all_gather_weight(
+            async_op=True,
+            skip_weight_cast=True,
+            cast_noop_flag=None,
+            fwd=True,
+            nvtx_label=nvtx_label,
+        )
+        target._recompute_prefetch_handle = handle
+
+    def _get_recompute_prefetched_weight(self):
+        # Recompute-chain analogue of _get_prefetched_weight (state-neutral; reads the
+        # rowwise _ag_ticket_fwd via the _recompute_* slot).
+        if self._recompute_already_drained:
+            # Producer already drained via wait_async_comms (CG capture); skip the
+            # captured cross-graph wait (CUDA no-op anyway).
+            self._recompute_already_drained = False
+        else:
+            self._wait_recompute_param_gather()
+            self._recompute_ag_event.wait()
+
+        result = []
+        cache = get_global_GTP_cache()
+        for w in self._weights:
+            result.append(cache.get(w._ag_ticket_fwd))
+        result = [self._strip_padding(r) for r in result]
+        result = [r.detach().requires_grad_(w.requires_grad) for r, w in zip(result, self._weights)]
+        return result if self.is_routed_expert else result[0]
+
+    def all_gather_and_prefetch_bwd(self, nvtx_label=None):
+        """Backward variant: get the current weight (cached if prefetched, else sync gather)
+        and async-prefetch prev_w.
+
+        Safe via the coat-check cache: get() returns the current buffer to the pool, and the
+        prefetch's checkout allocates a separate buffer if the pool is empty (current buffer
+        still live via the caller's reference).
+
+        Returns:
+            weight_total
+        """
+
+        if GTP_CONFIG.weight_prefetch and self.next_w is not None:
+            result = self._get_prefetched_weight(False, skip_weight_cast=True)
+        else:
+            result = self._all_gather_weight_on_demand(False, skip_weight_cast=True)
+
+        if (
+            GTP_CONFIG.weight_prefetch
+            and self.prev_w is not None
+            and self.prev_w._need_weight_prefetch
+            and self.prev_w._need_weight_prefetch_bwd
+        ):
+            # Pre-AG work (quantize, ticket lookup) runs on caller's stream; the NCCL collective
+            # is wrapped on ag_stream inside _all_gather_weight (see its async/sync gate).
+            _, handle = self.prev_w._all_gather_weight(
+                async_op=True,
+                skip_weight_cast=True,
+                cast_noop_flag=None,
+                fwd=False,
+                nvtx_label=nvtx_label,
+            )
+            self.prev_w._prefetch_handle = handle
+
+        # The unsharded tensor has been returned, no pending work so reset state to NONE
+        if GTP_CONFIG.check_param_states:
+            for w in self._weights:
+                w._set_state(GTPWeightState.NONE)
+
+        if GTP_CONFIG.weight_prefetch and self.next_w is not None:
+            cache = get_global_GTP_cache()
+            for w in self._weights:
+                cache.release(w._ag_ticket_bwd)
+
+        return result
+
+    def batched_all_gather_and_prefetch_bwd(self, nvtx_label=None):
+        """Batched backward all-gather + prefetch. Wrapper around all_gather_and_prefetch_bwd."""
+        assert self.is_routed_expert and self.weight_list is not None
+        return self.all_gather_and_prefetch_bwd(nvtx_label=nvtx_label)
+
+    def all_gather_and_prefetch(
+        self,
+        fwd: bool = True,
+        skip_weight_cast: bool = False,
+        cast_noop_flag: torch.Tensor = None,
+        nvtx_label: str = None,
+    ):
+        """All-gather the current weight and async-prefetch the next.
+
+        Returns:
+            weight_total
+        """
+        # During an activation-recompute forward (runs in backward), route consume +
+        # prefetch through the recompute-forward chain on its own _recompute_* slot
+        # (see __init__) instead of the fwd/bwd chains; lazy-built below.
+        in_recompute = in_fp8_activation_recompute_phase()
+        use_recompute_chain = in_recompute and GTP_CONFIG.weight_prefetch
+
+        # Consume current weight.
+        if use_recompute_chain and self._recompute_prev is not None:
+            result = self._get_recompute_prefetched_weight()
+        elif not in_recompute and GTP_CONFIG.weight_prefetch and self.prev_w is not None:
+            result = self._get_prefetched_weight(True, skip_weight_cast, cast_noop_flag)
+        else:
+            # On-demand: chain head (fwd or recompute global-first) or first-iter build.
+            result = self._all_gather_weight_on_demand(True, skip_weight_cast, cast_noop_flag)
+
+        # Prefetch next weight on the matching chain.
+        if (
+            use_recompute_chain
+            and self._recompute_next is not None
+            and self._recompute_next._need_weight_prefetch
+        ):
+            self._recompute_prefetch_next(self._recompute_next, nvtx_label=nvtx_label)
+        elif (
+            not in_recompute
+            and GTP_CONFIG.weight_prefetch
+            and self.next_w is not None
+            and self.next_w._need_weight_prefetch
+        ):
+            # Pre-AG work on caller; NCCL wrap lives at the collective site
+            # inside _all_gather_weight. See all_gather_and_prefetch_bwd.
+            _, handle = self.next_w._all_gather_weight(
+                async_op=True,
+                skip_weight_cast=skip_weight_cast,
+                cast_noop_flag=cast_noop_flag,
+                fwd=fwd,
+                nvtx_label=nvtx_label,
+            )
+            self.next_w._prefetch_handle = handle
+
+        # Unsharded tensor returned, no pending work → reset state to NONE. Skip during recompute:
+        # a bwd-chain prefetch may hold an in-flight AG state this weight's later dgrad needs.
+        if GTP_CONFIG.check_param_states and not in_recompute:
+            for w in self._weights:
+                w._set_state(GTPWeightState.NONE)
+
+        cls = type(self)
+
+        # Lazy-build the recompute-forward prefetch chain (first backward, in recompute order).
+        # Consume/prefetch above used the prior iter's links, so the first backward runs on-demand
+        # while these are established.
+        if in_recompute and not self._recompute_initialized:
+            rchain = cls._get_recompute_chain_state(self.chain_id)
+            last_r = rchain["last_weight"]
+            if last_r is not None and last_r._recompute_next is None:
+                last_r._recompute_next = self
+                self._recompute_prev = last_r
+            self._recompute_initialized = True
+            rchain["last_weight"] = self
+
+        # Lazy population of the fwd/bwd linked list: link previous weight to current.
+        # Uses per-chain state so dense and expert chains never cross-link.
+        chain = cls._get_chain_state(self.chain_id)
+        if not self.prefetch_initialized:
+            last_w = chain["last_weight"]
+            if last_w is not None and last_w.next_w is None:
+                cls._buffer_link_table_row(last_w, self, chain)
+                last_w.next_w = self
+                self.prev_w = last_w
+
+            cache = get_global_GTP_cache()
+
+            # Set the fwd ag buffer
+            quantizers = [w._quantizer for w in self._weights]
+            dtypes = [
+                q.dtype if q is not None else w.dtype for q, w in zip(quantizers, self._weights)
+            ]
+            for w, dt in zip(self._weights, dtypes):
+                w._ag_ticket_fwd = cache.reserve(w, dt, fwd=True)
+                cache.get(w._ag_ticket_fwd)
+                cache.release(w._ag_ticket_fwd)
+
+            self.prefetch_initialized = True
+            chain["last_weight"] = self
+        elif not chain["link_table_flushed"] and chain["link_table_buffer"]:
+            # Second forward pass: flush the complete table atomically to avoid interleaving
+            chain["link_table_flushed"] = True
+            log_single_rank(logger, logging.INFO, "\n".join(chain["link_table_buffer"]) + "\n")
+
+        return result
+
+    def batched_all_gather_and_prefetch(self, **kwargs):
+        """Batched all-gather + prefetch for expert weights (wraps all_gather_and_prefetch)."""
+        assert self.is_routed_expert and self.weight_list is not None
+        return self.all_gather_and_prefetch(**kwargs)
+
+    def get_wgrad_tensor(self):
+        """Pool-allocate a wgrad scratch tensor of unsharded shape for the bwd GEMM."""
+        return _wgrad_pool_get(self._unsharded_shape, self.main_grad.dtype, self.device)
+
+    def register_grad_accum_hook(self, grad_accum_node, hook):
+        """Register a DDP backward hook to call after the wgrad RS finalize.
+
+        For GTP params autograd may receive None (async RS), so the normal grad-accumulator
+        hook never fires; the integrator (Graphed.backward for captured chains, or the eager
+        chain-tail cascade) calls this hook explicitly after RS wait + accumulation, so DDP's
+        register_grad_ready fires at the right time. grad_accum_node is accepted for API
+        compatibility but not retained — only the hook callable.
+        """
+        del grad_accum_node
+        self._grad_accum_hook = hook
+
+    @staticmethod
+    def _handle_megatron_grad_accum(param):
+        """Handle megatron DDP and gradient-accumulation fusion.
+
+        Do NOT set param.grad before calling the hook — the hook checks param.grad and would
+        accumulate it into main_grad if zero_out_wgrad is True, corrupting it with a dummy.
+        """
+        if hasattr(param, "grad_added_to_main_grad"):
+            param.grad_added_to_main_grad = True
+        dummy_grad = get_dummy_wgrad(list(param.main_grad.shape), param.dtype)
+        if getattr(param, "_grad_accum_hook", None) is not None:
+            param._grad_accum_hook()
+
+        param._set_rs_state(GTPWeightState.NONE)
+        return dummy_grad
+
+    def _wait_reduce_scatter(self, finalize_grad=False):
+        # Enter rs_stream context so handle.wait() + rs_event.record() land on rs_stream
+        # (mirrors _wait_param_gather). With finalize_grad=True, main_grad.add_ also runs on
+        # rs_stream right after the NCCL RS — starts during AG drain, not after, avoiding
+        # SM-saturation that blocks cross-graph overlap.
+        rs_stream = self._cached_rs_stream
+        if rs_stream is None:
+            rs_stream = get_rs_stream(self.chain_id, self.group)
+            self._cached_rs_stream = rs_stream
+        with torch.cuda.stream(rs_stream):
+            if self._wgrad_rs_handle is not None:
+                self._wgrad_rs_handle.wait()
+                self._wgrad_rs_handle = None
+                self.rs_event.record()
+                if finalize_grad:
+                    cache = get_global_GTP_cache()
+                    for w in self._weights:
+                        wgrad_rs = cache.get(w._rs_ticket)
+                        w.main_grad.add_(wgrad_rs)
+                        cache.release(w._rs_ticket)
+                    # Fire grad-ready AFTER all adds (separate loop so a bucket-completing
+                    # grad-ready can't dispatch the RS before a sibling's add). With autograd
+                    # grad-ready suppressed for GTP params (DDP register_grad_accum_hook), this
+                    # is the only grad-ready for a weight finalized here; else the bucket orphans.
+                    for w in self._weights:
+                        self._handle_megatron_grad_accum(w)
+                    self._already_finalized = True
+        # Release stashed wgrad inputs: UNGRAPHED buffers go back to the pool;
+        # GRAPHED just drops Python refs (addresses must stay stable for CG).
+        if getattr(self, "_wgrad_input_bufs", None) is not None:
+            if self.chain_id == GTPChain.UNGRAPHED.value:
+                for buf in self._wgrad_input_bufs:
+                    _wgrad_pool_put(buf)
+            self._wgrad_input_bufs = None
+
+    def _reduce_scatter(self, wgrads, async_op, nvtx_label=None):
+        """Reduce-scatter one or more wgrads → (outputs, handle). Single tensor: plain RS;
+        multiple: coalesced RS."""
+        if nvtx_label is None:
+            nvtx_label = self._debug_name + ".bwd" + (".async" if async_op else ".sync")
+
+        if GTP_CONFIG.check_param_states:
+            new_rs_state = GTPWeightState.ASYNC_WAIT if async_op else GTPWeightState.DATA_READY_SYNC
+            for w in self._weights:
+                w._set_rs_state(new_rs_state)
+
+        if self.pad_length > 0:
+            wgrads = [torch.nn.functional.pad(w, (0, 0, 0, self.pad_length)) for w in wgrads]
+
+        if async_op:
+            dtypes = [w.dtype for w in wgrads]
+            out_buffers = []
+            cache = get_global_GTP_cache()
+            for p, dt in zip(self._weights, dtypes):
+                if p._rs_ticket is None:
+                    p._rs_ticket = cache.reserve(p, dt, fwd=False, reduce_scatter=True)
+                out_buffers.append(cache.get(p._rs_ticket))
+        else:
+            out_buffers = [None] * len(wgrads)
+
+        # ASYNC RS: issue on rs_stream so its tail reflects the collective's full lifecycle
+        # (what external wait_stream(rs_stream) drains depend on). The explicit outer→rs_stream
+        # sync event preserves the wgrad-GEMM writer edge the bare stream context drops; held on
+        # self so the event pool can't recycle it between capture and replay. Mirrors the AG path.
+        # SYNC RS: stay on caller — output ready on return.
+        if async_op:
+            outer_stream = torch.cuda.current_stream()
+            rs_stream = get_rs_stream(self.chain_id, self.group)
+            if getattr(self, "_rs_outer_sync_event", None) is None:
+                self._rs_outer_sync_event = torch.cuda.Event()
+            outer_sync_event = self._rs_outer_sync_event
+            outer_sync_event.record(outer_stream)
+            rs_stream.wait_event(outer_sync_event)
+            rs_ctx = torch.cuda.stream(rs_stream)
+        else:
+            rs_ctx = nullcontext()
+
+        with rs_ctx:
+            if len(wgrads) == 1:
+                nvtx_range_push(f"{nvtx_label}.gtp_rs")
+                out, handle = reduce_scatter_along_first_dim(
+                    wgrads[0], self.group, async_op=async_op, output=out_buffers[0]
+                )
+                nvtx_range_pop(f"{nvtx_label}.gtp_rs")
+                return [out], handle
+
+            outputs = []
+            nvtx_range_push(f"{nvtx_label}.batched_gtp_rs")
+            with torch.distributed._coalescing_manager(
+                group=self.group, device=wgrads[0].device, async_ops=async_op
+            ) as cm:
+                for out_buffer, tensor in zip(out_buffers, wgrads):
+                    out, _ = reduce_scatter_along_first_dim(tensor, self.group, output=out_buffer)
+                    outputs.append(out)
+            nvtx_range_pop(f"{nvtx_label}.batched_gtp_rs")
+
+            return outputs, cm if async_op else None
+
+    def wgrad_reduce_scatter(self, wgrad, nvtx_label=None):
+        """Reduce-scatter wgrad(s): sync for the last weight, async+deferred for others.
+        Accepts a single tensor (non-routed) or a list (routed experts).
+
+        Returns:
+            Single tensor or list for sync (last weight) — backward returns this.
+            None or tuple of Nones for async — backward returns this.
+        """
+        batched = isinstance(wgrad, (list, tuple))
+        wgrads = list(wgrad) if batched else [wgrad]
+        weights = self._weights
+
+        # UNGRAPHED wgrads recycle via the standalone pool (_wgrad_pool_put); GRAPHED wgrads
+        # cannot, since CUDA graphs require stable buffer addresses across replay.
+        poolable = self.chain_id == GTPChain.UNGRAPHED.value
+
+        if GTP_CONFIG.async_reduction and self.prev_w is not None:
+            # Async RS (not last weight — deferred finish). Pre-RS work on caller; NCCL wrap
+            # lives at the collective site inside _reduce_scatter (mirrors the AG prefetch sites).
+            _, rs_handle = self._reduce_scatter(wgrads, async_op=True, nvtx_label=nvtx_label)
+            self._wgrad_rs_handle = GTPShardHandle(rs_handle, weights, reduce_scatter=True)
+            # Stash wgrad input buffers — cannot recycle yet because the async RS
+            # kernel is still reading them on rs_stream.
+            self._wgrad_input_bufs = wgrads
+            ret = tuple([None] * len(wgrads)) if batched else None
+        else:
+            # Sync reduce-scatter — reached as the natural chain-head case, recycle immediately
+            wgrads, _ = self._reduce_scatter(wgrads, async_op=False, nvtx_label=nvtx_label)
+            nvtx_range_push(f"{nvtx_label}.gtp_wgrad_accum")
+            if len(weights) == 1:
+                weights[0].main_grad.add_(wgrads[0])
+            else:
+                torch._foreach_add_([p.main_grad for p in weights], wgrads)
+            nvtx_range_pop(f"{nvtx_label}.gtp_wgrad_accum")
+            result = [self._handle_megatron_grad_accum(p) for p in weights]
+
+            if poolable:
+                for buf in wgrads:
+                    _wgrad_pool_put(buf)
+            ret = result if batched else result[0]
+
+        # Wait for last reduce scatter if it was async
+        # Currently only support reduce scattering in reverse order
+        if GTP_CONFIG.async_reduction and self.next_w is not None:
+            self.next_w._wait_reduce_scatter()
+
+            if getattr(self.next_w, "_already_finalized", False):
+                self.next_w._already_finalized = False
+            else:
+                self.next_w.rs_event.wait()
+                cache = get_global_GTP_cache()
+                next_weights = self.next_w._weights
+                wgrads = [cache.get(w._rs_ticket) for w in next_weights]
+                nvtx_range_push(f"{self.next_w._debug_name}.gtp_wgrad_accum_deferred")
+                # Only batch with _foreach_add_ when finalizing multiple (routed) weights.
+                if len(next_weights) == 1:
+                    next_weights[0].main_grad.add_(wgrads[0])
+                else:
+                    torch._foreach_add_([w.main_grad for w in next_weights], wgrads)
+                nvtx_range_pop(f"{self.next_w._debug_name}.gtp_wgrad_accum_deferred")
+                for w in next_weights:
+                    self._handle_megatron_grad_accum(w)
+                    cache.release(w._rs_ticket)
+
+        return ret
+
+    def batched_wgrad_reduce_scatter(self, wgrad_list, nvtx_label=None):
+        """Batched version of wgrad_reduce_scatter."""
+        assert self.is_routed_expert and self.weight_list is not None
+        return self.wgrad_reduce_scatter(wgrad_list, nvtx_label=nvtx_label)
+
+    def get_data_tensors(self):
+        """Expose self as the lone data tensor for TE's offload-marking interface.
+
+        TE's mark_activation_offload treats any non-plain tensor as a storage wrapper and calls
+        get_data_tensors() on it; a sharded param has no inner buffers, so it is its own.
+        """
+        return (self,)
+
+    def __torch_function__(self, func, types, args=(), kwargs=None):
+        """Subclass-preserving dispatch for ``detach`` (other ops fall through)."""
+        del types  # required by protocol, unused here
+        if kwargs is None:
+            kwargs = {}
+
+        if func is torch.Tensor.detach:
+            with torch._C.DisableTorchFunctionSubclass():
+                # Perform the raw detach
+                result = func(*args, **kwargs)
+            # Re-wrap it in your subclass so PyTorch is happy
+            return result.as_subclass(type(self))
+
+        # 2. For everything else (add, mul, etc.), be transparent/decay.
+        with torch._C.DisableTorchFunctionSubclass():
+            return func(*args, **kwargs)
+
+
+@dataclass
+class _TicketSlot:
+    """Internal slot backing a persistent ticket in the GTP buffer cache."""
+
+    key: tuple  # cache key (shape, dtype, ...)
+    param: "GTPShardedParam"  # for lazy allocation metadata
+    dtype: object  # torch.dtype or tex.DType
+    reduce_scatter: bool
+    fwd: bool
+    chain_id: str = GTPChain.GRAPHED.value  # chain this slot belongs to
+    buf: Optional[torch.Tensor] = field(default=None)  # None when released or after clear()
+
+
+# CUDA-graph memory pool: routes GRAPHED-chain allocations (AG/RS buffers, quantized weight
+# storage) into the capture pool at creation time, avoiding post-hoc reallocation. Registered
+# via set_cuda_graph_mempool before the first graphed forward; stays None when CG is off, where
+# _graphed_alloc is a no-op (regular allocator).
+_CG_MEMPOOL_DEVICE = None
+_CG_MEMPOOL = None
+
+
+def set_cuda_graph_mempool(device, mempool):
+    """Register the CUDA-graph memory pool for GRAPHED-chain GTP allocations."""
+    global _CG_MEMPOOL_DEVICE, _CG_MEMPOOL
+    _CG_MEMPOOL_DEVICE = device
+    _CG_MEMPOOL = mempool
+
+
+@contextmanager
+def _graphed_alloc(chain_id):
+    """Route allocations in this block into the registered CG mempool when ``chain_id``
+    is GRAPHED and a pool is registered; otherwise a no-op (regular allocator)."""
+    if _CG_MEMPOOL is not None and chain_id == GTPChain.GRAPHED.value:
+        torch._C._cuda_beginAllocateCurrentThreadToPool(_CG_MEMPOOL_DEVICE, _CG_MEMPOOL)
+        try:
+            yield
+        finally:
+            torch._C._cuda_endAllocateToPool(_CG_MEMPOOL_DEVICE, _CG_MEMPOOL)
+    else:
+        yield
+
+
+class GTPWeightCache:
+    """Ticket-based buffer pool for GTP all-gather / reduce-scatter buffers.
+
+    - reserve(param, dtype, fwd) → ticket: assign a persistent ticket (no buffer yet).
+    - get(ticket) → buffer: return the buffer, lazily (re)allocating from pool or fresh.
+    - release(ticket): return the buffer to the pool; ticket stays valid.
+    - clear(): drop all buffers/pools; tickets stay valid, next get() allocates fresh.
+    """
+
+    # Bytes per element for known dtypes (for logging). Add entries when GTP caches buffers of
+    # new quantized dtypes — only DType values the TE pybind bindings expose (verify via
+    # hasattr(tex.DType, ...) before adding speculative entries).
+    _BYTES_PER_ELEMENT = {
+        torch.bfloat16: 2,
+        torch.float16: 2,
+        torch.float32: 4,
+        tex.DType.kFloat4E2M1: 0.5,
+        tex.DType.kFloat8E4M3: 1,
+        tex.DType.kFloat8E5M2: 1,
+    }
+
+    def __init__(self):
+        self._pool: Dict[tuple, List[torch.Tensor]] = defaultdict(list)
+        self._slots: Dict[int, _TicketSlot] = {}
+        self._next_ticket: int = 0
+        self._total_bytes: int = 0  # running total of allocated bytes
+        self.key_to_allocate_func = {}
+
+    @staticmethod
+    def _buf_bytes(shape, dtype) -> int:
+        """Estimate buffer size in bytes."""
+        numel = 1
+        for d in shape:
+            numel *= d
+        if dtype not in GTPWeightCache._BYTES_PER_ELEMENT:
+            raise KeyError(
+                f"GTPWeightCache._buf_bytes: unknown dtype {dtype!r}.  "
+                "Add it to GTPWeightCache._BYTES_PER_ELEMENT with its bytes-per-element."
+            )
+        return int(numel * GTPWeightCache._BYTES_PER_ELEMENT[dtype])
+
+    def _allocate_buffer(
+        self, param: "GTPShardedParam", dtype, reduce_scatter, fwd
+    ) -> torch.Tensor:
+        if reduce_scatter:
+            out_shape = param._sharded_padded_shape
+        else:
+            out_shape = param._unsharded_shape_padded
+
+        # Route GRAPHED-chain buffers into the CG mempool at creation (see _graphed_alloc).
+        with _graphed_alloc(getattr(param, "chain_id", GTPChain.UNGRAPHED.value)):
+            if not isinstance(dtype, torch.dtype):
+                quantizer = param._quantizer
+                assert quantizer is not None
+                param._quantizer.set_usage(rowwise=fwd, columnwise=not fwd)
+
+                buf = param._quantizer.make_empty(
+                    out_shape, dtype=torch.bfloat16, device=torch.cuda.current_device()
+                )
+            else:
+                buf = torch.empty(
+                    out_shape,
+                    dtype=dtype,
+                    device=param.device,
+                    memory_format=torch.contiguous_format,
+                )
+
+        buf_bytes = self._buf_bytes(out_shape, dtype)
+        self._total_bytes += buf_bytes
+        dtype_str = (
+            str(dtype) if isinstance(dtype, torch.dtype) else getattr(dtype, "name", str(dtype))
+        )
+        log_single_rank(
+            logger,
+            logging.INFO,
+            f"[GTP Cache] +{buf_bytes / 1024**2:.1f} MB  (shape={out_shape}, dtype={dtype_str})  "
+            f"total={self._total_bytes / 1024**2:.1f} MB  param: {param._debug_name} fwd: {fwd}",
+        )
+        return buf
+
+    def reserve(self, param: "GTPShardedParam", dtype, fwd: bool, reduce_scatter=False) -> int:
+        """Assign a persistent ticket.  No buffer is allocated until ``get()``."""
+        key = param._get_cache_key(dtype, fwd, reduce_scatter)
+        ticket = self._next_ticket
+        self._next_ticket += 1
+
+        self._slots[ticket] = _TicketSlot(
+            key=key,
+            param=param,
+            dtype=dtype,
+            reduce_scatter=reduce_scatter,
+            fwd=fwd,
+            chain_id=getattr(param, "chain_id", GTPChain.UNGRAPHED.value),
+        )
+        return ticket
+
+    def get(self, ticket: int) -> torch.Tensor:
+        """Return the buffer for *ticket*, lazily allocating if needed."""
+        slot = self._slots[ticket]
+        if slot.buf is None:
+            pool = self._pool[slot.key]
+            slot.buf = (
+                pool.pop()
+                if pool
+                else self._allocate_buffer(
+                    slot.param, slot.dtype, slot.reduce_scatter, fwd=slot.fwd
+                )
+            )
+            self.key_to_allocate_func[slot.key] = (
+                slot.param,
+                slot.dtype,
+                slot.reduce_scatter,
+                slot.fwd,
+            )
+
+        return slot.buf
+
+    def release(self, ticket: int):
+        """Return the buffer to the pool (ticket stays valid).
+
+        slot.buf is intentionally NOT cleared: get() must stay idempotent so CUDA-graph-captured
+        buffers keep their fixed address across replays.
+        """
+        slot = self._slots[ticket]
+        if slot.buf is None:
+            return
+        # Use identity check — tensor == tensor returns a multi-element bool tensor
+        # which crashes in a boolean context ("Boolean value of Tensor is ambiguous").
+        if not any(b is slot.buf for b in self._pool.get(slot.key, [])):
+            self._pool[slot.key].append(slot.buf)
+
+    def clear(self):
+        """Drop all buffers; tickets remain valid and lazily re-allocate on next get()."""
+        for slot in self._slots.values():
+            slot.buf = None
+        self._pool.clear()
+        self._total_bytes = 0
+
+
+def get_global_GTP_cache() -> GTPWeightCache:
+    """Get or lazily create the global cache instance."""
+    global _GTP_CACHE
+    if _GTP_CACHE is None:
+        _GTP_CACHE = GTPWeightCache()
+    return _GTP_CACHE
+
+
+def wait_async_comms(
+    chain_id: str = None, skip_rs: bool = False, finalize_after_drain: bool = False
+):
+    """Drain in-flight GTP async AG / RS handles.
+
+    Inside CUDA graph capture the drains are captured into the graph — the producer-side hook
+    for cross-graph overlap. A captured cudaStreamWaitEvent on another capture session's event is
+    a CUDA no-op, so consumers can't wait cross-graph; instead the producer drains here and flags
+    the param, and the consumer skips its captured wait.
+
+    Args:
+        chain_id: If specified, only drain params on this chain.
+        skip_rs:  Drain AG only; leave RS in flight.
+        finalize_after_drain: After RS drain, also accumulate wgrad into
+                 main_grad. Runs main_grad.add_ on rs_stream (right after
+                 NCCL RS) so it starts during AG drain rather than after,
+                 avoiding SM-saturation that blocks cross-graph overlap.
+                 Falls back to caller-stream accumulation if no RS handle.
+
+    Per-param side effects:
+        * _already_ag_drained = True   (if an AG handle was drained)
+        * _already_finalized  = True   (if finalize_after_drain=True)
+    """
+    for param in list(_inflight_comm_params):
+        if (
+            chain_id is not None
+            and getattr(param, "chain_id", GTPChain.UNGRAPHED.value) != chain_id
+        ):
+            continue
+        had_ag = param._prefetch_handle is not None
+        param._wait_param_gather()
+        if had_ag:
+            param._already_ag_drained = True
+        # Recompute-forward chain: drain its separate in-flight rowwise AG so the
+        # captured recompute consumer skips its cross-graph wait (full-iteration CG).
+        if param._recompute_prefetch_handle is not None:
+            param._wait_recompute_param_gather()
+            param._recompute_already_drained = True
+        if not skip_rs:
+            param._wait_reduce_scatter(finalize_grad=finalize_after_drain)
+            # Fallback inline-accumulation: only when finalize is requested, _wait_reduce_scatter
+            # didn't already finalize, and an RS actually ran (rs_ticket set). Skips pure-AG
+            # prefetches in _inflight_comm_params (no wgrad).
+            need_fallback_accumulation = (
+                finalize_after_drain
+                and not getattr(param, "_already_finalized", False)
+                and any(w._rs_ticket is not None for w in param._weights)
+            )
+            if need_fallback_accumulation:
+                cache = get_global_GTP_cache()
+                param.rs_event.wait()
+                for w in param._weights:
+                    w._set_rs_state(GTPWeightState.NONE)
+                    wgrad_rs = cache.get(w._rs_ticket)
+                    w.main_grad.add_(wgrad_rs)
+                    cache.release(w._rs_ticket)
+                    if hasattr(w, "grad_added_to_main_grad"):
+                        w.grad_added_to_main_grad = True
+                param._already_finalized = True
+
+
+@dataclass
+class BatchedNVFP4AllGatherAsyncHandle:
+    """Handle for batched asynchronous NVFP4 all-gathers."""
+
+    output_handles: List[_NVFP4AllGatherAsyncHandle]
+    outer_async_handle: torch.distributed.Work
+    _synchronized: bool = False
+
+    def wait(self) -> None:
+        """Wait for the async operation to complete and post-process the tensor."""
+        if self._synchronized:
+            return
+        self.outer_async_handle.wait()
+        # Fixes interleaved data for transposed tensor/scale inv and pads scale inv if needed.
+        for output_handle in self.output_handles:
+            if output_handle is not None:
+                assert output_handle.async_handle is None
+                output_handle.post_process_nvfp4_gather()
+                # release any tensor references just in case
+                output_handle.output = None
+                output_handle.columnwise_data_interleaved = None
+                output_handle.columnwise_scale_inv_interleaved = None
+
+        self._synchronized = True
+
+
+def grouped_gather_along_first_dim(
+    weights: list,
+    process_group,
+    async_op: bool = False,
+    quantizers: list = None,
+    output_tensors: list = None,
+):
+    """All-gather multiple weights in one coalesced op; handles NVFP4 post-processing for both
+    sync and async paths."""
+    # Determine device from first weight.
+    inp = weights[0]
+    if isinstance(inp, NVFP4TensorStorage):
+        device = (
+            inp._rowwise_data.device
+            if inp._rowwise_data is not None
+            else inp._columnwise_data.device
+        )
+    else:
+        device = inp.device
+
+    weights_all = []
+    weight_handles = []
+    with torch.distributed._coalescing_manager(
+        group=process_group, device=device, async_ops=async_op
+    ) as gather_coalescing_manager:
+        for i, weight in enumerate(weights):
+            weight_all, weight_handle = gather_along_first_dim(
+                weight,
+                process_group,
+                quantizer=quantizers[i],
+                output_tensor=output_tensors[i] if output_tensors is not None else None,
+                grouped=True,
+            )
+            weights_all.append(weight_all)
+            weight_handles.append(weight_handle)
+
+    if async_op:
+        handle = gather_coalescing_manager
+        has_nvfp4_handles = any(isinstance(wh, _NVFP4AllGatherAsyncHandle) for wh in weight_handles)
+        if has_nvfp4_handles:
+            handle = BatchedNVFP4AllGatherAsyncHandle(weight_handles, handle)
+    else:
+        for wh in weight_handles:
+            if isinstance(wh, _NVFP4AllGatherAsyncHandle):
+                wh.post_process_nvfp4_gather()
+        handle = None
+
+    return weights_all, handle
+
+
+class GTPEmbeddingWeight(torch.autograd.Function):
+    """All-gather the embedding weight across the GTP group in forward, reduce-scatter its
+    gradient in backward.
+
+    The weight is stored sharded along the vocab dimension; this materializes the full weight
+    for the lookup and distributes the gradient back to the shard.
+    """
+
+    @staticmethod
+    def forward(ctx, weight):
+        """All-gather the full embedding weight across the GTP group for the lookup."""
+        ctx.save_for_backward(weight)
+        return weight.all_gather_and_prefetch(fwd=True)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        """Reduce-scatter the gradient back to this rank's vocab-dim shard."""
+        (weight,) = ctx.saved_tensors
+        return weight.wgrad_reduce_scatter(grad_output)
+
+
+def reset_gtp_state():
+    """Clear the process-global GTP prefetch-chain state (GTPShardedParam._chain_state /
+    ._recompute_chain_state).
+
+    These class-level dicts survive model teardown, so a GTP model rebuilt in-process would
+    inherit stale last_weight pointers / flushed link tables. Call once before the per-chunk
+    classify_gtp_chains loop (never inside it — chains span chunks). No-op on a fresh process.
+    """
+    GTPShardedParam._chain_state.clear()
+    GTPShardedParam._recompute_chain_state.clear()
+
+
+def reset_gtp_quantize_cache(model):
+    """Invalidate the per-shard low-precision cache after a checkpoint load.
+
+    DCP load copies new data into GTPShardedParam.data in-place, leaving a stale FP8/MXFP8/NVFP4
+    buffer in self.quantized. Call once after load so the next forward re-quantizes from the
+    freshly-loaded weight.
+    """
+    for param in model.parameters():
+        if isinstance(param, GTPShardedParam):
+            param.did_cast_to_low_precision = False
+
+
+# ------------------------------------------------------------------------
+# Distributed-checkpointing helpers
+# ------------------------------------------------------------------------
+# GTP shards axis 0 on top of TP, but the vanilla utils helpers only know TP, so their offsets
+# miss the GTP slice. The helper below detects GTPShardedParam per-tensor and composes TP × GTP
+# into one axis-0 offset (or two offsets), with replica_id = the DP-with-GTP-with-CP rank.
+
+
+def make_sharded_tensors_for_checkpoint_with_gtp(
+    state_dict,
+    prefix,
+    tensor_parallel_layers_axis_map=None,
+    sharded_offsets=(),
+    extra_state_suffix="_extra_state",
+    *,
+    tp_group,
+    dp_cp_group,
+    intra_dp_cp_no_gtp_group=None,
+):
+    """GTP-aware analogue of make_sharded_tensors_for_checkpoint.
+
+    Detects GTP sharding per-tensor (isinstance(tensor, GTPShardedParam)). Non-GTP tensors keep
+    the vanilla offsets exactly; GTP tensors layer the GTP axis-0 split on top. No-op (delegates
+    to the vanilla helper) when no tensor is a GTPShardedParam, so this is zero-cost when GTP is
+    inactive.
+    """
+    from megatron.core.transformer.utils import (  # noqa: E402
+        make_sharded_object_for_checkpoint,
+        make_sharded_tensors_for_checkpoint,
+    )
+    from megatron.core.utils import (  # noqa: E402
+        get_pg_rank,
+        get_pg_size,
+        make_sharded_tensor_for_checkpoint,
+        make_tp_sharded_tensor_for_checkpoint,
+    )
+
+    # Fast path: no GTP-sharded params → defer to vanilla helper, same output.
+    if not any(isinstance(t, GTPShardedParam) for t in state_dict.values()):
+        return make_sharded_tensors_for_checkpoint(
+            state_dict,
+            prefix,
+            tensor_parallel_layers_axis_map,
+            sharded_offsets,
+            extra_state_suffix=extra_state_suffix,
+            tp_group=tp_group,
+            dp_cp_group=dp_cp_group,
+        )
+
+    if tensor_parallel_layers_axis_map is None:
+        tensor_parallel_layers_axis_map = {}
+
+    tp_rank = get_pg_rank(tp_group)
+    tp_size = get_pg_size(tp_group)
+    # All GTP params in this state_dict share the same gtp_group (set by the
+    # wrap hook at module init), so pick it off the first GTP shard.
+    gtp_group = next(t.group for t in state_dict.values() if isinstance(t, GTPShardedParam))
+    gtp_rank = get_pg_rank(gtp_group)
+    gtp_size = get_pg_size(gtp_group)
+
+    # DP-with-GTP-with-CP rank — replicas of a given GTP chunk live here.
+    if intra_dp_cp_no_gtp_group is not None:
+        dp_no_gtp_rank = get_pg_rank(intra_dp_cp_no_gtp_group)
+    else:
+        from megatron.core import parallel_state  # noqa: E402
+
+        dp_no_gtp_rank = parallel_state.get_data_parallel_rank(
+            with_context_parallel=True, no_gtp=True
+        )
+
+    sharded_state_dict = {}
+    for layer_name, tensor in state_dict.items():
+        layer_key = f"{prefix}{layer_name}"
+        is_gtp = isinstance(tensor, GTPShardedParam)
+
+        if layer_name.endswith(extra_state_suffix):
+            # ShardedObject (extra_state metadata): GTP-REPLICATED across the GTP group. Fold
+            # gtp_rank into position 1 of the replica_id (PP, TP-replica-coord, DP) tuple so
+            # GTP-peer ranks within the same TP slice get unique replica_ids.
+            replica_id = (0, tp_rank * gtp_size + gtp_rank, dp_no_gtp_rank)
+            sharded_state_dict[layer_key] = make_sharded_object_for_checkpoint(
+                tensor, layer_key, sharded_offsets, replica_id=replica_id
+            )
+            continue
+
+        if not is_gtp:
+            # Non-GTPShardedParam under a GTP-active module (e.g. bias): GTP-replicated, so GTP
+            # ranks would collide on the same replica_id. Inject gtp_rank into replica_id
+            # position 1 (same as the GTP-sharded branch below).
+            if layer_name in tensor_parallel_layers_axis_map:
+                replica_id = (0, gtp_rank, dp_no_gtp_rank)
+                sharded_state_dict[layer_key] = make_tp_sharded_tensor_for_checkpoint(
+                    tensor,
+                    layer_key,
+                    tp_axis=tensor_parallel_layers_axis_map[layer_name],
+                    replica_id=replica_id,
+                    prepend_offsets=sharded_offsets,
+                    tp_group=tp_group,
+                    dp_cp_group=dp_cp_group,
+                )
+            else:
+                replica_id = (0, tp_rank * gtp_size + gtp_rank, dp_no_gtp_rank)
+                sharded_state_dict[layer_key] = make_sharded_tensor_for_checkpoint(
+                    tensor,
+                    layer_key,
+                    replica_id=replica_id,
+                    prepend_offsets=sharded_offsets,
+                    tp_group=tp_group,
+                    dp_cp_group=dp_cp_group,
+                )
+            continue
+
+        # GTP-sharded tensor: delegate to the GTP-aware single-tensor helper — it layers the
+        # axis-0 GTP split onto TP, elects the writer over the gtp-excluded DP group, and sets
+        # allow_shape_mismatch for alignment padding. (tp_axis None → 0; tp_size 1 when no TP.)
+        tp_axis = tensor_parallel_layers_axis_map.get(layer_name, None)
+        sharded_state_dict[layer_key] = make_tp_sharded_tensor_for_checkpoint(
+            tensor,
+            layer_key,
+            tp_axis=tp_axis if tp_axis is not None else 0,
+            prepend_offsets=sharded_offsets,
+            tp_group=tp_group,
+            dp_cp_group=dp_cp_group,
+        )
+
+    return sharded_state_dict
+
+
+# Wire GTP into TE's hook registry at import time, so any later
+# ``te.Linear(gtp_group=...)`` routes through the hooks below. If TE is too old to
+# expose ``register_gtp_hooks``, GTP silently no-ops (the warning surfaces that).
+try:
+    from transformer_engine.pytorch.module.base import (  # noqa: E402
+        register_gtp_hooks as _te_register_gtp_hooks,
+    )
+
+    _te_register_gtp_hooks(
+        slice_fn=gtp_slice_in_reset_parameters,
+        finalize_fn=gtp_finalize_module_in_reset_parameters,
+        wrap_fn=wrap_module_params_gtp,
+    )
+except ImportError:
+    warnings.warn(
+        "megatron.core.tensor_parallel.gtp: TransformerEngine does not expose register_gtp_hooks; "
+        "GTP will be a no-op for te.Linear / te.LayerNormLinear / te.GroupedLinear. "
+        "GTP requires TransformerEngine >= 2.17 (planned release). "
+        "Upgrade TransformerEngine to a build that includes the GTP hook registry.",
+        RuntimeWarning,
+        stacklevel=2,
+    )
diff --git a/megatron/core/tensor_parallel/gtp.py b/megatron/core/tensor_parallel/gtp.py
new file mode 100644
index 00000000000..fa6db4eeee7
--- /dev/null
+++ b/megatron/core/tensor_parallel/gtp.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Generalized Tensor Parallelism (GTP) public API.
+
+GTP shards weight tensors 1/N across a GTP process group along ``out_features``
+and materializes them on-demand via async all-gather. The implementation lives
+in ``megatron.core.tensor_parallel.generalized_tensor_parallelism`` and depends
+on TransformerEngine's FP8 / MXFP8 / NVFP4 primitives.
+
+If TransformerEngine is missing or too old, the inner module imports cleanly
+but stubs its TE-backed symbols and reports ``HAVE_TE = False``; this module
+mirrors that as ``HAVE_GTP = False``. Consumers gate every GTP code path behind
+``if HAVE_GTP:``, so no core module uses GTP symbols without TE.
+"""
+
+try:
+    from megatron.core.tensor_parallel.generalized_tensor_parallelism import (
+        GTP_CONFIG,
+        HAVE_TE,
+        GTPChain,
+        GTPEmbeddingWeight,
+        GTPShardedParam,
+        classify_gtp_chains,
+        get_ag_stream,
+        get_rs_stream,
+        make_sharded_tensors_for_checkpoint_with_gtp,
+        reset_gtp_quantize_cache,
+        reset_gtp_state,
+        set_cuda_graph_mempool,
+        set_cuda_graph_modules,
+        tag_gtp_params_with_names,
+        update_gtp_config,
+        wait_async_comms,
+        wait_for_gtp_grad_reduction_on_current_stream,
+        wrap_module_params_gtp,
+    )
+
+    HAVE_GTP = HAVE_TE
+except ImportError:
+    # Defensive fallback for any unexpected inner-import failure; consumers import
+    # the other symbols lazily under an ``if HAVE_GTP:`` guard, so no stubs needed.
+    HAVE_GTP = False
+
+
+__all__ = [
+    "HAVE_GTP",
+    "GTP_CONFIG",
+    "GTPChain",
+    "GTPEmbeddingWeight",
+    "GTPShardedParam",
+    "classify_gtp_chains",
+    "get_ag_stream",
+    "get_rs_stream",
+    "make_sharded_tensors_for_checkpoint_with_gtp",
+    "reset_gtp_quantize_cache",
+    "reset_gtp_state",
+    "set_cuda_graph_mempool",
+    "set_cuda_graph_modules",
+    "tag_gtp_params_with_names",
+    "update_gtp_config",
+    "wait_async_comms",
+    "wait_for_gtp_grad_reduction_on_current_stream",
+    "wrap_module_params_gtp",
+]
diff --git a/megatron/core/tensor_parallel/inference_layers.py b/megatron/core/tensor_parallel/inference_layers.py
index 2adefc58634..8da5c3fbeb6 100644
--- a/megatron/core/tensor_parallel/inference_layers.py
+++ b/megatron/core/tensor_parallel/inference_layers.py
@@ -83,6 +83,7 @@ def __init__(
         is_expert: bool = False,
         symmetric_ar_type: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
     ):
         assert HAVE_TE, "--transformer-impl=inference_optimized requires transformer engine"
@@ -131,6 +132,7 @@ def __init__(
         skip_weight_param_allocation: bool = False,
         tp_comm_buffer_name: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
     ):
         assert HAVE_TE, "--transformer-impl=inference_optimized requires transformer engine"
@@ -260,6 +262,7 @@ def __init__(
         skip_weight_param_allocation: bool = False,
         tp_comm_buffer_name: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
     ):
         assert HAVE_TE, "--transformer-impl=inference_optimized requires transformer engine"
@@ -358,6 +361,7 @@ def __init__(
         is_expert: bool,
         tp_comm_buffer_name: Optional[str] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
     ):
         assert HAVE_TE, "--transformer-impl=inference_optimized requires transformer engine"
diff --git a/megatron/core/tensor_parallel/layers.py b/megatron/core/tensor_parallel/layers.py
index c072c52bd05..70d62e0610d 100644
--- a/megatron/core/tensor_parallel/layers.py
+++ b/megatron/core/tensor_parallel/layers.py
@@ -16,7 +16,9 @@
 
 from megatron.core.model_parallel_config import ModelParallelConfig
 from megatron.core.parallel_state import (
+    get_expert_gtp_weight_remat_rank,
     get_global_memory_buffer,
+    get_gtp_weight_remat_rank,
     get_tensor_model_parallel_rank,
     get_tensor_model_parallel_world_size,
 )
@@ -101,6 +103,29 @@ def param_is_not_tensor_parallel_duplicate(param, tp_group=None):
     return get_tensor_model_parallel_rank() == 0
 
 
+def copy_gtp_attributes(destination, source):
+    """Copy the GTP dedup tags (is_gtp, allreduce) onto a param view/copy, so the optimizer's
+    master shards stay classifiable by param_is_not_gtp_duplicate."""
+    for attr in ("is_gtp", "allreduce"):
+        if hasattr(source, attr):
+            setattr(destination, attr, getattr(source, attr))
+
+
+def param_is_not_gtp_duplicate(param):
+    """Returns true if the param's grad should be counted once across the GTP/EGTP axis.
+
+    GTP/EGTP shards are unique per peer (always kept); replicated params are counted only on
+    rank 0 of the gtp/egtp axis (else counted gtp/egtp times). When GTP is off the rank is 0,
+    so every param is kept.
+    """
+    if getattr(param, "is_gtp", False):
+        return True
+    is_expert = not getattr(param, "allreduce", True)
+    if is_expert:
+        return get_expert_gtp_weight_remat_rank() == 0
+    return get_gtp_weight_remat_rank() == 0
+
+
 def set_tensor_model_parallel_attributes(tensor, is_parallel, dim, stride):
     """Sets tp attributes to tensor"""
     # Make sure the attributes are not set.
@@ -219,6 +244,7 @@ def __init__(
         reduce_scatter_embeddings: bool = False,
         config: ModelParallelConfig,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
     ):
         super(VocabParallelEmbedding, self).__init__()
         # Keep the input dimensions.
@@ -281,6 +307,17 @@ def __init__(
                     tensor=self.weight, is_parallel=True, dim=0, stride=1
                 )
 
+        self.gtp_size = 1
+        if gtp_group is not None and gtp_group.size() > 1:
+            from megatron.core.tensor_parallel.gtp import wrap_module_params_gtp
+
+            wrap_module_params_gtp(self, ["weight"], gtp_group)
+            self.gtp_size = gtp_group.size()
+            # Nothing prefetches embedding — it is head of the UNGRAPHED
+            # chain in fwd, and its bwd bypasses all_gather_and_prefetch_bwd
+            # via GTPEmbeddingWeight.backward.
+            self.weight._need_weight_prefetch = False
+
     def forward(self, input_):
         """Forward.
 
@@ -295,12 +332,19 @@ def forward(self, input_):
             masked_input[input_mask] = 0
         else:
             masked_input = input_
+
+        weight = self.weight
+        if self.gtp_size > 1:
+            from megatron.core.tensor_parallel.gtp import GTPEmbeddingWeight
+
+            weight = GTPEmbeddingWeight.apply(self.weight)
+
         # Get the embeddings.
         if self.deterministic_mode:
-            output_parallel = self.weight[masked_input]
+            output_parallel = weight[masked_input]
         else:
             # F.embedding currently has a non-deterministic backward function
-            output_parallel = F.embedding(masked_input, self.weight)
+            output_parallel = F.embedding(masked_input, weight)
         # Mask the output embedding.
         if self.tp_group.size() > 1:
             output_parallel[input_mask, :] = 0.0
@@ -400,6 +444,7 @@ def linear_with_frozen_weight(
     tp_group: Optional[torch.distributed.ProcessGroup],
     grad_output_buffer: Optional[List[torch.Tensor]] = None,
     wgrad_deferral_limit: None = None,
+    gtp_size: int = 1,
 ) -> torch.Tensor:
     """Linear layer execution with weight.requires_grad == False.
 
@@ -436,6 +481,10 @@ def linear_with_frozen_weight(
 
     wgrad_deferral_limit (int optional): dummy argument, used to
     keep the API unified between all forward implementation functions.
+
+    gtp_size (int): GTP shard count. When > 1 the weight is GTP-sharded and must be
+    all-gathered to its full shape before the matmul, mirroring the trainable path.
+    Defaults to 1 (no-op) for the common non-GTP / non-sharded case.
     """
 
     assert grad_output_buffer is None, (
@@ -456,6 +505,9 @@ def linear_with_frozen_weight(
     else:
         input = input
 
+    if gtp_size > 1:
+        weight = weight.all_gather_and_prefetch(fwd=True)
+
     args = [input, weight, bias, allreduce_dgrad, tp_group]
 
     return LinearWithFrozenWeight.apply(*args)
@@ -477,6 +529,7 @@ def forward(
         grad_output_buffer,
         wgrad_deferral_limit,
         tp_group,
+        gtp_size,
     ):
         """Forward."""
         if gradient_accumulation_fusion and hasattr(weight, "main_grad"):
@@ -484,6 +537,10 @@ def forward(
         else:
             main_grad = None
         ctx.save_for_backward(input, weight)
+
+        if gtp_size > 1:
+            weight = weight.all_gather_and_prefetch(fwd=True)
+
         # We can't save main_grad in save_for_backward as this module would be
         # reused across layers like MTP logits. So, to prevent in-place modification
         # checks we save the tensor in ctx.
@@ -495,6 +552,7 @@ def forward(
         ctx.wgrad_deferral_limit = wgrad_deferral_limit
         ctx.grad_output_buffer = grad_output_buffer
         ctx.tp_group = tp_group
+        ctx.gtp_size = gtp_size
 
         if sequence_parallel:
             dim_size = list(input.size())
@@ -518,6 +576,13 @@ def backward(ctx, grad_output):
         input, weight = ctx.saved_tensors
         main_grad = ctx.main_grad
         use_bias = ctx.use_bias
+
+        # GTP: re-gather weight for dgrad
+        if ctx.gtp_size > 1:
+            sharded_weight = weight
+            weight = sharded_weight.all_gather_and_prefetch_bwd()
+            ctx.gradient_accumulation_fusion = False
+
         grad_output_buffer = ctx.grad_output_buffer
         wgrad_deferral_limit = ctx.wgrad_deferral_limit
         handle = None
@@ -651,16 +716,31 @@ def backward(ctx, grad_output):
             grad_weight = grad_output.t().matmul(total_input)
         grad_bias = grad_output.sum(dim=0) if use_bias else None
 
+        # GTP: reduce-scatter wgrad
+        if ctx.gtp_size > 1 and grad_weight is not None:
+            grad_weight = sharded_weight.wgrad_reduce_scatter(grad_weight)
+
         if ctx.sequence_parallel:
             handle.wait()
             # Need to return None's as gradient has to flow for all the input arguments
             # provided during forward
-            return (sub_grad_input, grad_weight, grad_bias, None, None, None, None, None, None)
+            return (
+                sub_grad_input,
+                grad_weight,
+                grad_bias,
+                None,
+                None,
+                None,
+                None,
+                None,
+                None,
+                None,
+            )
 
         if ctx.allreduce_dgrad:
             handle.wait()
 
-        return grad_input, grad_weight, grad_bias, None, None, None, None, None, None
+        return grad_input, grad_weight, grad_bias, None, None, None, None, None, None, None
 
 
 def linear_with_grad_accumulation_and_async_allreduce(
@@ -673,6 +753,7 @@ def linear_with_grad_accumulation_and_async_allreduce(
     grad_output_buffer: Optional[List[torch.Tensor]] = None,
     wgrad_deferral_limit: Optional[int] = 0,
     tp_group: Optional[torch.distributed.ProcessGroup] = None,
+    gtp_size: int = 1,
 ) -> torch.Tensor:
     """Linear layer execution with asynchronous communication and
     gradient accumulation fusion in backprop.
@@ -749,6 +830,7 @@ def linear_with_grad_accumulation_and_async_allreduce(
         grad_output_buffer,
         wgrad_deferral_limit,
         tp_group,
+        gtp_size,
     ]
 
     if not linear_with_grad_accumulation_and_async_allreduce.warned:
@@ -844,6 +926,7 @@ def __init__(
         disable_grad_reduce: bool = False,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
     ):
         super(ColumnParallelLinear, self).__init__()
 
@@ -923,6 +1006,13 @@ def __init__(
         else:
             self.weight = None
 
+        self.gtp_size = 1
+        if gtp_group is not None and gtp_group.size() > 1:
+            from megatron.core.tensor_parallel.gtp import wrap_module_params_gtp
+
+            wrap_module_params_gtp(self, ["weight"], gtp_group)
+            self.gtp_size = gtp_group.size()
+
         if bias:
             if config.use_cpu_initialization:
                 self.bias = Parameter(
@@ -1075,6 +1165,7 @@ def forward(
                 else None
             ),
             tp_group=self.tp_group,
+            gtp_size=self.gtp_size,
         )
 
         gather_output = self.gather_output
@@ -1191,6 +1282,7 @@ def __init__(
         tp_comm_buffer_name: str | None = None,  # Not used
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
     ):
         super(RowParallelLinear, self).__init__()
 
@@ -1271,6 +1363,13 @@ def __init__(
                 )
         setattr(self.weight, "allreduce", not (self.is_expert and self.expert_parallel))
 
+        self.gtp_size = 1
+        if gtp_group is not None and gtp_group.size() > 1:
+            from megatron.core.tensor_parallel.gtp import wrap_module_params_gtp
+
+            wrap_module_params_gtp(self, ["weight"], gtp_group)
+            self.gtp_size = gtp_group.size()
+
         if bias:
             if config.use_cpu_initialization:
                 self.bias = Parameter(torch.empty(self.output_size, dtype=config.params_dtype))
@@ -1343,6 +1442,7 @@ def forward(self, input_: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
             sequence_parallel=False,
             tp_group=None,
             grad_output_buffer=None,
+            gtp_size=self.gtp_size,
         )
 
         # All-reduce across all the partitions.
diff --git a/megatron/core/transformer/attention.py b/megatron/core/transformer/attention.py
index 8fad62c60c5..4a91b8670a0 100644
--- a/megatron/core/transformer/attention.py
+++ b/megatron/core/transformer/attention.py
@@ -409,6 +409,7 @@ def __init__(
             tp_comm_buffer_name='proj',
             tp_group=self.pg_collection.tp,
             name=(name + ".linear_proj") if name is not None else None,
+            gtp_group=self.pg_collection.gtp,
         )
 
         if (
@@ -1643,6 +1644,7 @@ def __init__(
             tp_comm_buffer_name='qkv',
             tp_group=self.pg_collection.tp,
             name=(name + ".linear_qkv") if name is not None else None,
+            gtp_group=self.pg_collection.gtp,
         )
 
         # Resolve which norm class to use for Q and K.
diff --git a/megatron/core/transformer/cuda_graphs.py b/megatron/core/transformer/cuda_graphs.py
index 0de90c9cde4..7ed5dbda1b0 100644
--- a/megatron/core/transformer/cuda_graphs.py
+++ b/megatron/core/transformer/cuda_graphs.py
@@ -20,6 +20,7 @@
 import torch
 from torch.utils._pytree import tree_map as tree_map_pyt
 
+from megatron.core import parallel_state
 from megatron.core.num_microbatches_calculator import get_num_microbatches
 from megatron.core.process_groups_config import ProcessGroupCollection
 from megatron.core.tensor_parallel.random import (
@@ -57,6 +58,33 @@
 except:
     HAVE_TE_GRAPHS = False
 
+try:
+    from megatron.core.tensor_parallel.gtp import HAVE_GTP
+except ImportError:
+    # GTP requires TransformerEngine with the GTP hook registry; treat it as
+    # unavailable when that import path cannot be resolved.
+    HAVE_GTP = False
+
+if HAVE_GTP:
+    from megatron.core.tensor_parallel.gtp import (
+        GTP_CONFIG,
+        GTPChain,
+        get_ag_stream,
+        get_rs_stream,
+        set_cuda_graph_mempool,
+        wait_async_comms,
+    )
+else:
+    # Placeholders so static analysis does not flag these GTP-only symbols as
+    # possibly-used-before-assignment; every use site is guarded by HAVE_GTP /
+    # gtp_remat at runtime.
+    GTP_CONFIG = None
+    GTPChain = None
+    get_ag_stream = None
+    get_rs_stream = None
+    set_cuda_graph_mempool = None
+    wait_async_comms = None
+
 try:
     from tqdm import tqdm
 
@@ -69,6 +97,16 @@
 logger = logging.getLogger(__name__)
 
 
+_GTP_RUNNER_STREAMS: List[torch.cuda.Stream] = []
+
+
+def get_gtp_runner_streams() -> List[torch.cuda.Stream]:
+    """Replay streams of all GTP CG runners; finalize_model_grads waits on these
+    (tail = captured Phase 2 main_grad.add_) before reading main_grad.
+    """
+    return _GTP_RUNNER_STREAMS
+
+
 def _set_skip_fp8_weight_update_tensor(skip: bool) -> None:
     """Toggle TE's FP8 "skip weight refresh" flag between microbatches.
 
@@ -342,6 +380,36 @@ def _ensure_generator_state_is_cudagraph_safe(gen: torch.Generator) -> torch.Gen
 bwd_buffer_reuse_ref_count = 0
 
 
+def _backup_grads_before_capture(runner):
+    """Snapshot main_grad so create_fwd_graph's eager warmup can't corrupt the finalized grads;
+    restore with ``_restore_grads_after_capture``.
+    """
+    backup = {}
+    for p in runner.base_module.parameters():
+        mg = getattr(p, "main_grad", None)
+        if mg is not None:
+            backup[id(p)] = (p, mg.clone())
+
+    if runner.gtp_remat:
+        # GTP only: also protect the cross-graph next_w the cascade accumulates into.
+        for p in runner.base_module.parameters():
+            nw = getattr(p, "next_w", None) if getattr(p, "is_gtp", False) else None
+            if nw is None:
+                continue
+            shards = nw.weight_list if getattr(nw, "is_routed_expert", False) else [nw]
+            for w in shards or []:
+                mg = getattr(w, "main_grad", None)
+                if mg is not None and id(w) not in backup:
+                    backup[id(w)] = (w, mg.clone())
+    return backup
+
+
+def _restore_grads_after_capture(backup):
+    """Restore the main_grad snapshots taken by ``_backup_grads_before_capture``."""
+    for p, saved in backup.values():
+        p.main_grad.copy_(saved)
+
+
 class _CudagraphGlobalRecord:
     """A global datastructure that records of the ordering of all _CudaGraphRunner's
     first fwd or bwd passes. 'create_cudagraphs' will use this to create
@@ -416,6 +484,11 @@ def create_cudagraphs(cls):
                     "https://github.com/NVIDIA/TransformerEngine/blob/v2.10/transformer_engine/pytorch/utils.py#L759"  # pylint: disable=line-too-long
                 )
 
+        gtp_active = any(r[0].gtp_remat for r in cls.cudagraph_record)
+        if gtp_active:
+            # GTP buffer reuse during capture trips the param-state debug asserts; disable them.
+            GTP_CONFIG.check_param_states = False
+
         gc.collect()
         torch.cuda.empty_cache()
 
@@ -535,6 +608,7 @@ def delete_cuda_graphs():
     _CudagraphGlobalRecord.cudagraph_created = False
     _CudagraphGlobalRecord.cudagraph_record = []
     _CudagraphGlobalRecord.cudagraph_inference_record = []
+    _GTP_RUNNER_STREAMS.clear()
 
     # TODO: Optional?: Force garbage collection to clean up memory
     gc.collect()
@@ -636,7 +710,14 @@ def forward(ctx, runner, is_first_microbatch, *inputs):
                 _set_skip_fp8_weight_update_tensor(not is_first_microbatch)
                 runner.fp8_param_cache_updated = is_first_microbatch
 
-        runner.fwd_graph.replay()
+        if runner.use_stream:
+            runner.stream.wait_stream(torch.cuda.current_stream())
+            with torch.cuda.stream(runner.stream):
+                runner.fwd_graph.replay()
+            torch.cuda.current_stream().wait_event(runner.fwd_completion_event)
+        else:
+            runner.fwd_graph.replay()
+
         return runner.fwd_graph_output_surface
 
     @staticmethod
@@ -669,7 +750,14 @@ def backward(ctx, *grads):
             if user_output_grad.data_ptr() != cudagraph_output_grad.data_ptr():
                 cudagraph_output_grad.copy_(user_output_grad)
 
-        runner.bwd_graph.replay()
+        if runner.use_stream:
+            runner.stream.wait_stream(torch.cuda.current_stream())
+            with torch.cuda.stream(runner.stream):
+                runner.bwd_graph.replay()
+            torch.cuda.current_stream().wait_event(runner.bwd_completion_event)
+        else:
+            runner.bwd_graph.replay()
+
         runner.status = _GraphStatus.FWD_READY
 
         # Update FP8 scale factors if needed
@@ -684,10 +772,24 @@ def backward(ctx, *grads):
         for param, grad_added in runner.groundtruth_grad_added_to_main_grad.items():
             param.grad_added_to_main_grad = grad_added
 
+        # DDP grad-ready hook is silenced at capture/replay, so fire it here (on each param's
+        # rs_stream, after wait_stream(runner.stream) fences Phase 2) to let DDP RS overlap bwd.
+        if runner.gtp_remat:
+            for gtp_rs_stream, params in runner._gtp_finalize_hook_plan:
+                gtp_rs_stream.wait_stream(runner.stream)
+                with torch.cuda.stream(gtp_rs_stream):
+                    for param in params:
+                        hook = getattr(param, '_grad_accum_hook', None)
+                        if hook is not None:
+                            hook()
+
         # Replaying the next bwd graph destroys the data held in static_grad_inputs, so clone
         # wgrads as autograd may launch the next graph before wgrads are accumulated
         dgrads = runner.static_grad_inputs[: runner.num_dgrads]
-        wgrads = (g.clone() for g in runner.static_grad_inputs[runner.num_dgrads :])
+        wgrads = (
+            g.clone() if torch.is_tensor(g) else g
+            for g in runner.static_grad_inputs[runner.num_dgrads :]
+        )
 
         return None, None, *dgrads, *wgrads
 
@@ -736,6 +838,16 @@ def __init__(
         self.fp4_runtime_enabled = None
         self.deallocate_pipeline_outputs = False
         self.num_warmup_steps = 0
+        self.use_stream = False
+        self.gtp_remat = False
+        self.fwd_side_streams = []
+        self.bwd_side_streams = []
+        # Populated by create_bwd_graph: GTP params whose main_grad.add_ was captured in THIS
+        # graph.  Used in Graphed.backward's post-replay hook loop to fire DDP hooks only in the
+        # graph whose replay populates main_grad.
+        self.finalized_during_bwd_capture = []
+        # (rs_stream, params) DDP grad-ready hook plan; built in create_bwd_graph.
+        self._gtp_finalize_hook_plan = []
 
         self.grad_enabled = need_backward and torch.is_grad_enabled()
         self.func = super(MegatronModule, self.base_module).__call__ if func is None else func
@@ -760,6 +872,33 @@ def __init__(
             self.fp4_enabled = self.base_module.config.fp4 is not None
             self.fp8_runtime_enabled = None
             self.fp4_runtime_enabled = None
+            self.gtp_remat = self.base_module.config.gtp_weight_remat_size > 1
+
+            if self.gtp_remat:
+                # Ensure internal warmup (inside create_fwd_graph) has >= 2 steps
+                # for GTP: 1st builds chain + tickets, 2nd exercises prefetch path.
+                self.num_warmup_steps = max(self.num_warmup_steps, 2)
+
+                self.use_stream = True
+                self.stream = torch.cuda.Stream()
+                self.fwd_completion_event = torch.cuda.Event(external=True, interprocess=True)
+                self.bwd_completion_event = torch.cuda.Event(external=True, interprocess=True)
+                # Register (chain, group) side streams before the first forward.
+                # Dense for mamba/attn/shared_experts; expert (below) for routed
+                # experts captured when "moe" is in cuda_graph_modules.
+                from megatron.core.parallel_state import (
+                    get_expert_gtp_weight_remat_group,
+                    get_expert_gtp_weight_remat_world_size,
+                    get_gtp_weight_remat_group,
+                )
+
+                self._register_gtp_side_streams(get_gtp_weight_remat_group())
+                # EGTP streams: required so _wait/_sync_side_streams drain EGTP
+                # NCCL into runner_stream before bwd_completion_event fires.
+                if get_expert_gtp_weight_remat_world_size() > 1:
+                    self._register_gtp_side_streams(get_expert_gtp_weight_remat_group())
+                # Registered for finalize_model_grads to wait on (Phase 2 fence).
+                _GTP_RUNNER_STREAMS.append(self.stream)
 
             if self.fp8_enabled:
                 self.fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
@@ -771,6 +910,56 @@ def __init__(
                 self.fp4_recipe = get_fp4_recipe(self.base_module.config)
                 _set_skip_fp8_weight_update_tensor(False)
 
+    def _register_gtp_side_streams(self, group):
+        """Register a GTP (chain, group)'s GRAPHED AG/RS side streams for capture/replay sync: the
+        AG stream on both fwd and bwd, the RS stream on bwd only."""
+        ag = get_ag_stream(GTPChain.GRAPHED.value, group)
+        rs = get_rs_stream(GTPChain.GRAPHED.value, group)
+        self.fwd_side_streams.append(ag)
+        self.bwd_side_streams.append(ag)
+        self.bwd_side_streams.append(rs)
+
+    def _sync_against_side_streams(self, side_streams):
+        """Make registered side streams wait for the current stream.
+        Also injects a dummy kernel into each stream to ensure it is non-empty,
+        which is required for CUDA graph capture (joining an empty captured
+        stream is a CUDA error)."""
+        for s in side_streams:
+            s.wait_stream(torch.cuda.current_stream())
+            with torch.cuda.stream(s):
+                torch.cuda._sleep(1)
+
+    def _wait_side_streams(self, side_streams):
+        """Make the current stream wait for all registered side streams."""
+        for s in side_streams:
+            torch.cuda.current_stream().wait_stream(s)
+
+    def _compute_finalized_during_bwd_capture(self):
+        """Return GTP params whose DDP grad-ready hook fires post-replay
+        of THIS bwd_graph.
+
+        A param's hook must fire in the graph that physically populates its
+        main_grad. Rules, given the cascade walk in wgrad_reduce_scatter
+        finalizes p.next_w on behalf of p:
+          - p.prev_w is None → p is sync-finalized in p's own graph; add p.
+          - p.next_w is not None → p.next_w's main_grad.add_ is captured here
+            via p's cascade; add p.next_w. (For cross-graph chain tails the
+            wait was captured in the producer's Phase 2, but the add lives
+            here regardless, bridged by external rs_event.)
+        """
+        finalized = {}  # id → param
+        for p in self.params_to_backprop:
+            if not getattr(p, 'is_gtp', False):
+                continue
+            if getattr(p, "prev_w", None) is None:
+                for w in getattr(p, "_weights", [p]):
+                    finalized[id(w)] = w
+            next_w = getattr(p, "next_w", None)
+            if next_w is not None:
+                for w in getattr(next_w, "_weights", [next_w]):
+                    finalized[id(w)] = w
+        return list(finalized.values())
+
     def __str__(self):
         return "%s; hid %s" % (
             self.base_module.__class__.__name__,
@@ -836,9 +1025,7 @@ def create_fwd_graph(self, args, kwargs, outputs=None, clone_inputs=True):
             for buf in self.base_module.buffers():
                 buffer_backup.append(buf.clone())
 
-            grad_backup = []
-            for param in self.base_module.parameters():
-                grad_backup.append(param.main_grad.clone() if hasattr(param, "main_grad") else None)
+            grad_backup = _backup_grads_before_capture(self)
 
             saved_fp8_tensors = None
             if self.fp8_enabled:
@@ -967,6 +1154,10 @@ def clone_ten(ten):
                         allow_unused=True,
                     )
 
+                if self.gtp_remat:
+                    wait_async_comms(GTPChain.GRAPHED.value)
+                    self._sync_against_side_streams(self.bwd_side_streams)
+
             _set_warmup_end()
 
             with self.get_quantization_context():
@@ -987,10 +1178,23 @@ def clone_ten(ten):
                 with torch.cuda.graph(
                     self.fwd_graph, pool=self.mempool, capture_error_mode="thread_local"
                 ):
+
+                    self._sync_against_side_streams(self.fwd_side_streams)
+
                     fwd_graph_outputs = self.func(
                         *self.fwd_graph_input_args, **self.fwd_graph_input_kwargs
                     )
 
+                    if self.gtp_remat:
+                        # Forward only issues AG prefetches (no wgrad RS), so drain AG and skip RS.
+                        wait_async_comms(GTPChain.GRAPHED.value, skip_rs=True)
+
+                    if self.fwd_side_streams:
+                        self._wait_side_streams(self.fwd_side_streams)
+
+                    if self.use_stream:
+                        self.fwd_completion_event.record()
+
                 # Unfreeze GC.
                 if FREEZE_GC:
                     gc.unfreeze()
@@ -1037,9 +1241,7 @@ def clone_ten(ten):
             if self.fp8_enabled:
                 restore_fp8_tensors([self.base_module], saved_fp8_tensors)
             # restore cached grads
-            for main_grad_copy, param in zip(grad_backup, self.base_module.parameters()):
-                if main_grad_copy is not None:
-                    param.main_grad.copy_(main_grad_copy)
+            _restore_grads_after_capture(grad_backup)
 
             # restore cached buffers
             for buf_copy, buf in zip(buffer_backup, self.base_module.buffers()):
@@ -1097,6 +1299,9 @@ def create_bwd_graph(self):
             gc.freeze()
 
         with torch.cuda.graph(self.bwd_graph, pool=self.mempool):
+
+            self._sync_against_side_streams(self.bwd_side_streams)
+
             grad_inputs = torch.autograd.grad(
                 outputs=tuple(o for o in self.fwd_graph_output_surface if o.requires_grad),
                 inputs=tuple(i for i in self.fwd_graph_input_surface if i.requires_grad),
@@ -1106,10 +1311,71 @@ def create_bwd_graph(self):
                 allow_unused=True,
             )
 
+            # GTP cross-graph RS overlap, two phases:
+            #   Phase 1 — drain AG, fence runner_stream past ag_stream's tail,
+            #             then record bwd_completion_event so main_stream can
+            #             release the next runner while RS is still in flight.
+            #   Phase 2 — drain RS wait on rs_stream. For cross-graph chain
+            #             tails the wait is captured here, the add in the
+            #             consumer's cascade; for within-graph tails both
+            #             happen here (see wait_async_comms).
+            if self.gtp_remat:
+                # Phase 1: drain AG; fence runner_stream past dense + EGTP AG
+                # so bwd_completion_event records AFTER NCCL_AG completion.
+                wait_async_comms(GTPChain.GRAPHED.value, skip_rs=True)
+                from megatron.core.parallel_state import (
+                    get_expert_gtp_weight_remat_group,
+                    get_expert_gtp_weight_remat_world_size,
+                    get_gtp_weight_remat_group,
+                )
+
+                gtp_group = get_gtp_weight_remat_group()
+                graphed_ag = get_ag_stream(GTPChain.GRAPHED.value, gtp_group)
+                torch.cuda.current_stream().wait_stream(graphed_ag)
+                if get_expert_gtp_weight_remat_world_size() > 1:
+                    egtp_group = get_expert_gtp_weight_remat_group()
+                    egtp_graphed_ag = get_ag_stream(GTPChain.GRAPHED.value, egtp_group)
+                    torch.cuda.current_stream().wait_stream(egtp_graphed_ag)
+
+                # Record completion AFTER AG drain + fence but BEFORE RS drain,
+                # so main_stream can trigger the next runner while RS is still
+                # in flight on rs_stream.
+                self.bwd_completion_event.record()
+
+                # Phase 2: in-graph RS drain + finalize.
+                wait_async_comms(GTPChain.GRAPHED.value, finalize_after_drain=True)
+
+            if self.bwd_side_streams:
+                self._wait_side_streams(self.bwd_side_streams)
+
+            if self.use_stream and not self.gtp_remat:
+                # Non-GTP path: record after the side-stream join.
+                self.bwd_completion_event.record()
+
         # Unfreeze GC.
         if FREEZE_GC:
             gc.unfreeze()
 
+        # See _compute_finalized_during_bwd_capture for what's in this set and why.
+        self.finalized_during_bwd_capture = (
+            self._compute_finalized_during_bwd_capture() if self.gtp_remat else []
+        )
+
+        # Precompute the (rs_stream, params) DDP grad-ready hook plan once — it's
+        # replay-invariant — so Graphed.backward avoids per-replay group lookups.
+        self._gtp_finalize_hook_plan = []
+        if self.gtp_remat and self.finalized_during_bwd_capture:
+            dense_group = parallel_state.get_gtp_weight_remat_group()
+            expert_group = parallel_state.get_expert_gtp_weight_remat_group()
+            params_by_group = defaultdict(list)
+            for param in self.finalized_during_bwd_capture:
+                is_expert = not getattr(param, 'allreduce', True)
+                params_by_group[expert_group if is_expert else dense_group].append(param)
+            self._gtp_finalize_hook_plan = [
+                (get_rs_stream(GTPChain.GRAPHED.value, group), params)
+                for group, params in params_by_group.items()
+            ]
+
         # Constructs a tuple suitable for returning from Graphed.backward:
         # Pads out the actually-needed grads with Nones in gradient slots for inputs
         # that don't require grad
@@ -1519,6 +1785,10 @@ def wrapped_func(*args, eager=False, cache_key=None, **kwargs):
         self.reuse_cudagraphs = self.pg_collection.pp.size() == 1
         if CudaGraphManager.global_mempool is None:
             CudaGraphManager.global_mempool = torch.cuda.graph_pool_handle()
+            # Register the pool so GTP allocates GRAPHED-chain buffers + quantized
+            # storage directly into it (created before the first graphed forward).
+            if HAVE_GTP:
+                set_cuda_graph_mempool(torch.cuda.current_device(), CudaGraphManager.global_mempool)
             # Cudagraph stream capture requires no operations on the default stream prior to the
             # capture, so change to a side stream.
             torch.cuda.set_stream(torch.cuda.Stream())
@@ -1705,7 +1975,7 @@ def __call__(self, megatron_module, args, kwargs, cache_key=None):
 
         self.is_first_microbatch = False
         # If forward only, next replay should be a forward pass as well
-        if is_inference_mode or not torch.is_grad_enabled():
+        if is_inference_mode or not torch.is_grad_enabled() or not runner.fwd_graph_recorded:
             runner.status = _GraphStatus.FWD_READY
         else:
             runner.status = _GraphStatus.BWD_READY
diff --git a/megatron/core/transformer/mlp.py b/megatron/core/transformer/mlp.py
index 1a578151f1e..6328abc5aad 100644
--- a/megatron/core/transformer/mlp.py
+++ b/megatron/core/transformer/mlp.py
@@ -172,6 +172,7 @@ def __init__(
         ffn_hidden_size: Optional[int] = None,
         tp_group: Optional[torch.distributed.ProcessGroup] = None,
         name: str | None = None,
+        gtp_group: Optional[torch.distributed.ProcessGroup] = None,
     ):
         """
         Args:
@@ -224,6 +225,7 @@ def __init__(
             is_expert=is_expert,
             tp_comm_buffer_name="fc1",
             tp_group=tp_group,
+            gtp_group=gtp_group,
             stride=fc1_stride,
             name=(name + ".linear_fc1") if name is not None else None,
         )
@@ -247,6 +249,7 @@ def __init__(
             tp_comm_buffer_name="fc2",
             tp_group=tp_group,
             name=(name + ".linear_fc2") if name is not None else None,
+            gtp_group=gtp_group,
         )
 
     def forward(
@@ -391,10 +394,23 @@ def as_mlp_submodule(
         assert hasattr(
             pg_collection, 'tp'
         ), 'TP process group is required for MLP in TransformerLayer'
+
+        # Forward gtp_group so fc1/fc2 shard their weights (like attention / shared_experts).
+        # Only the non-fused MLP honors GTP; the TE op-fused variants (_make_fused_impl) build
+        # GEMMs straight from the weights without all-gathering shards, so fail fast on that combo.
+        gtp_group = getattr(pg_collection, 'gtp', None)
+        if hasattr(cls, '_make_fused_impl'):
+            assert gtp_group is None or gtp_group.size() == 1, (
+                f"{cls.__name__}: GTP sharding of the dense MLP is not supported with the "
+                "TE fused MLP / GroupedLinear path (_make_fused_impl ignores GTP shards). "
+                "Use the non-fused MLP submodule, or do not enable GTP for dense MLP layers."
+            )
+            gtp_group = None
         return cls(
             config=config,
             submodules=submodules,
             tp_group=pg_collection.tp,
+            gtp_group=gtp_group,
             is_expert=is_expert,
             input_size=input_size,
             ffn_hidden_size=ffn_hidden_size,
diff --git a/megatron/core/transformer/moe/moe_utils.py b/megatron/core/transformer/moe/moe_utils.py
index ca004bd10b1..76b112bf01f 100644
--- a/megatron/core/transformer/moe/moe_utils.py
+++ b/megatron/core/transformer/moe/moe_utils.py
@@ -1387,6 +1387,14 @@ def get_default_pg_collection() -> ProcessGroupCollection:
     pg_collection.cp = parallel_state.get_context_parallel_group()
     pg_collection.expt_tp = parallel_state.get_expert_tensor_parallel_group()
     pg_collection.expt_dp = parallel_state.get_expert_data_parallel_group()
+    # EGTP-excluded expert-DP groups used to stamp expert-weight replica_ids. Must not be
+    # left None (get_pg_rank(None)==0 -> duplicate-writer collision at checkpoint save).
+    pg_collection.expt_dp_no_egtp = parallel_state.get_expert_data_parallel_group(
+        no_gtp=True, check_initialized=False
+    )
+    pg_collection.intra_expt_dp_no_egtp = parallel_state.get_expert_data_parallel_group(
+        no_gtp=True, partial_expert_data_parallel=True, check_initialized=False
+    )
     pg_collection.tp_ep = parallel_state.get_expert_tensor_and_model_parallel_group()
     pg_collection.tp_cp = parallel_state.get_tensor_and_context_parallel_group()
     pg_collection.tp_dp_cp = parallel_state.get_tensor_and_data_parallel_group(
diff --git a/megatron/core/transformer/moe/shared_experts.py b/megatron/core/transformer/moe/shared_experts.py
index 8fc7876bf32..8c057a60a29 100644
--- a/megatron/core/transformer/moe/shared_experts.py
+++ b/megatron/core/transformer/moe/shared_experts.py
@@ -120,7 +120,13 @@ def __init__(
 
         config.ffn_hidden_size = config.moe_shared_expert_intermediate_size
         # TODO(Hepteract): pass pg_collection to MLP after refactoring MLP
-        super().__init__(config=config, submodules=submodules, tp_group=pg_collection.tp, name=name)
+        super().__init__(
+            config=config,
+            submodules=submodules,
+            tp_group=pg_collection.tp,
+            name=name,
+            gtp_group=pg_collection.gtp,
+        )
 
         self.use_shared_expert_gate = gate
         if self.use_shared_expert_gate:
diff --git a/megatron/core/transformer/multi_token_prediction.py b/megatron/core/transformer/multi_token_prediction.py
index b20514ce6a4..ca59e6d622d 100755
--- a/megatron/core/transformer/multi_token_prediction.py
+++ b/megatron/core/transformer/multi_token_prediction.py
@@ -1010,6 +1010,7 @@ def __init__(
             tp_comm_buffer_name="mtp_eh_proj",
             tp_group=pg_collection.tp if pg_collection is not None else None,
             name=(name + ".eh_proj") if name is not None else None,
+            gtp_group=pg_collection.gtp if pg_collection is not None else None,
         )
 
         # Build inner layers: two possible paths
diff --git a/megatron/core/transformer/transformer_config.py b/megatron/core/transformer/transformer_config.py
index be8fca56145..01bff60685f 100644
--- a/megatron/core/transformer/transformer_config.py
+++ b/megatron/core/transformer/transformer_config.py
@@ -925,7 +925,9 @@ class TransformerConfig(ModelParallelConfig):
     more details, see: https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html."""
 
     cuda_graph_warmup_steps: int = 3
-    """Number of warmup steps for CUDA graphs"""
+    """Number of warmup steps for CUDA graphs. Note: GTP (``gtp_weight_remat_size > 1``) forces a
+    minimum of 2 per-graph warmup steps regardless of this value, because the first warmup builds
+    the weight-prefetch chain and the second exercises the prefetch path before capture."""
 
     external_cuda_graph: bool = False
     """DEPRECATED and replaced by cuda_graph_impl.
@@ -2383,6 +2385,27 @@ def _scope_to_str(s):
                             "moe_input_jitter_eps is not supported with graphed moe recomputation."
                         )
 
+                    if (
+                        self.gtp_weight_remat_size > 1
+                        and self.cuda_graph_impl == "local"
+                        and (self.fp8 is not None or self.fp4 is not None)
+                        and self.moe_shared_expert_intermediate_size is not None
+                        and not self.moe_shared_expert_overlap
+                        and (
+                            full_cudagraph
+                            or CudaGraphModule.moe in self.cuda_graph_modules
+                            or CudaGraphModule.moe_router in self.cuda_graph_modules
+                        )
+                    ):
+                        assert "shared_experts" not in self.recompute_modules, (
+                            "GTP + local CUDA graphs that capture shared_experts "
+                            "(moe_router/moe scope) cannot recompute it under fp8/fp4: "
+                            "te_checkpoint requires .backward(), but the local fwd-graph "
+                            "warmup uses .grad(). Drop 'shared_experts' from "
+                            "--recompute-modules (GTP-shard + offload instead), or use "
+                            "--cuda-graph-impl full_iteration."
+                        )
+
             if self.fine_grained_activation_offloading:
                 offload_modules = set(self.offload_modules or [])
                 local_partial_moe_offload = (
diff --git a/megatron/core/transformer/utils.py b/megatron/core/transformer/utils.py
index 2249c79a2bd..82afaea56f0 100644
--- a/megatron/core/transformer/utils.py
+++ b/megatron/core/transformer/utils.py
@@ -132,6 +132,28 @@ def make_sharded_tensors_for_checkpoint(
         tp_group = get_tensor_model_parallel_group_if_none(tp_group)
         dp_cp_group = parallel_state.get_data_parallel_group(with_context_parallel=True)
 
+    # GTP-sharded weights need the GTP axis layered onto the TP/DP offsets. The GTP helper
+    # is a no-op for non-GTP state_dicts, but importing it eagerly would be circular, so
+    # gate on HAVE_GTP and the presence of a GTPShardedParam before delegating.
+    from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+    if HAVE_GTP:
+        from megatron.core.tensor_parallel.gtp import (
+            GTPShardedParam,
+            make_sharded_tensors_for_checkpoint_with_gtp,
+        )
+
+        if any(isinstance(t, GTPShardedParam) for t in state_dict.values()):
+            return make_sharded_tensors_for_checkpoint_with_gtp(
+                state_dict,
+                prefix,
+                tensor_parallel_layers_axis_map,
+                sharded_offsets,
+                extra_state_suffix=extra_state_suffix,
+                tp_group=tp_group,
+                dp_cp_group=dp_cp_group,
+            )
+
     sharded_state_dict = {}
     for layer_name in state_dict.keys():
         tensor = state_dict[layer_name]
diff --git a/megatron/core/utils.py b/megatron/core/utils.py
index 169aebc27f9..274d375955e 100644
--- a/megatron/core/utils.py
+++ b/megatron/core/utils.py
@@ -874,7 +874,10 @@ def check_param_hashes_across_dp_replicas(
     for params, local_param_hashes, all_gather_group in zip(
         [non_expert_params, expert_params],
         [local_non_expert_param_hashes, local_expert_param_hashes],
-        [parallel_state.get_data_parallel_group(), parallel_state.get_expert_data_parallel_group()],
+        [
+            parallel_state.get_data_parallel_group(no_gtp=True),
+            parallel_state.get_expert_data_parallel_group(no_gtp=True),
+        ],
     ):
         # Collect per-parameter hashes across all ranks in group.
         assert len(params) == len(local_param_hashes)
@@ -962,6 +965,37 @@ def make_tp_sharded_tensor_for_checkpoint(
             # FSDP2 shards axis 0 and TP shards some other axis
             new_offsets.append((prepend_axis_num, dp_rank, dp_size))
 
+    # GTP: a GTPShardedParam additionally shards out_features (axis 0) by 1/gtp. Layer that
+    # split onto the TP offset — mirrors make_sharded_tensors_for_checkpoint_with_gtp so direct
+    # callers (e.g. VocabParallelEmbedding, which can't use that wrapper because it needs
+    # allow_shape_mismatch) still save GTP weights with correct global offsets/shape.
+    from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+    if HAVE_GTP:
+        from megatron.core.tensor_parallel.gtp import GTPShardedParam
+
+        if isinstance(tensor, GTPShardedParam):
+            gtp_rank = get_pg_rank(tensor.group)
+            gtp_size = get_pg_size(tensor.group)
+            if tp_axis == 0:
+                # same axis as TP → one composite axis-0 offset
+                new_offsets[0] = (
+                    prepend_axis_num,
+                    tp_rank * gtp_size + gtp_rank,
+                    tp_size * gtp_size,
+                )
+            else:
+                # GTP shards axis 0, TP shards a different axis → add a separate axis-0 offset
+                new_offsets.append((prepend_axis_num, gtp_rank, gtp_size))
+            # GTP peers hold distinct shards (disambiguated by the offset above); the true
+            # replicas are the gtp-EXCLUDED DP group, so elect the writer over that group.
+            dp_replica_id = parallel_state.get_data_parallel_rank(
+                with_context_parallel=True, no_gtp=True
+            )
+            # Saved global is the padded shape when GTP padded out_features for alignment.
+            if getattr(tensor, "pad_length", 0):
+                kwargs.setdefault("allow_shape_mismatch", True)
+
     if replica_id is None:
         replica_id = (0, 0, dp_replica_id)
 
@@ -991,6 +1025,18 @@ def make_sharded_tensor_for_checkpoint(tensor, key, prepend_offsets=(), replica_
             - dp_cp_group: Data parallel + context parallel group
               (default: None, falls back to parallel_state)
     """
+    # Sanity guard.
+    from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+    if HAVE_GTP:
+        from megatron.core.tensor_parallel.gtp import GTPShardedParam
+
+        assert not isinstance(tensor, GTPShardedParam), (
+            f"GTPShardedParam '{key}' reached make_sharded_tensor_for_checkpoint (the replicated "
+            "path); route GTP-sharded weights through make_tp_sharded_tensor_for_checkpoint or "
+            "make_sharded_tensors_for_checkpoint instead."
+        )
+
     # Pop group parameters from kwargs
     tp_group = kwargs.pop('tp_group', None)
     dp_cp_group = kwargs.pop('dp_cp_group', None)
diff --git a/megatron/training/arguments.py b/megatron/training/arguments.py
index f305a5a7668..47083b0e574 100644
--- a/megatron/training/arguments.py
+++ b/megatron/training/arguments.py
@@ -1396,6 +1396,92 @@ def validate_args(args, defaults={}):
         if args.expert_model_parallel_size  > 1 and 'ep_dp' not in args.high_priority_stream_groups:
             args.high_priority_stream_groups.append('ep_dp')
 
+
+    # Derive the internal gtp_weight_remat_size from the user-facing
+    # --tensor-parallel-num-weight-shards. gtp_weight_remat_size has no CLI flag (it is excluded
+    # from argument generation), so it is set here as a fresh attribute on args before it is
+    # consumed below (and in initialize/training, which read args.gtp_weight_remat_size directly).
+    # Mirrors ModelParallelConfig.__post_init__.
+    from megatron.core.model_parallel_config import resolve_tensor_parallel_weight_shards
+    (args.tensor_parallel_num_weight_shards, args.gtp_weight_remat_size) = (
+        resolve_tensor_parallel_weight_shards(
+            args.tensor_model_parallel_size,
+            args.tensor_parallel_num_weight_shards,
+            getattr(args, "gtp_weight_remat_size", 1),
+        )
+    )
+    # Same for the expert layers: derive the internal expert_gtp_weight_remat_size from the
+    # user-facing --expert-tensor-parallel-num-weight-shards (expert_tensor_parallel_size is
+    # defaulted earlier in validate_args). expert_gtp_weight_remat_size has no CLI flag.
+    (args.expert_tensor_parallel_num_weight_shards, args.expert_gtp_weight_remat_size) = (
+        resolve_tensor_parallel_weight_shards(
+            args.expert_tensor_parallel_size,
+            args.expert_tensor_parallel_num_weight_shards,
+            getattr(args, "expert_gtp_weight_remat_size", 1),
+        )
+    )
+
+    if args.gtp_weight_remat_size > 1 or args.expert_gtp_weight_remat_size > 1:
+        gtp_weight_remat_size = args.gtp_weight_remat_size
+        egtp_weight_remat_size = args.expert_gtp_weight_remat_size
+        if get_device_arch_version() >= 10:
+            # Setting GTP communication groups for high priority streams for Blackwell and later
+            # architectures. Assigning high priority to communication streams ensures that
+            # communication kernels are scheduled with higher priority, minimizing the exposed
+            # communication when it is overlapped with other computation kernels.
+            if 'gtp' not in args.high_priority_stream_groups:
+                args.high_priority_stream_groups.append('gtp')
+                warn_rank_0("Setting 'gtp' group for high priority streams.")
+            if egtp_weight_remat_size > 1 and 'expt_gtp' not in args.high_priority_stream_groups:
+                args.high_priority_stream_groups.append('expt_gtp')
+                warn_rank_0("Setting 'expt_gtp' group for high priority streams.")
+
+            # Sanity check for 'CUDA_GRAPHS_USE_NODE_PRIORITY'.
+            if args.cuda_graph_impl != "none":
+                assert os.environ.get('CUDA_GRAPHS_USE_NODE_PRIORITY') == "1", \
+                    'GTP requires CUDA_GRAPHS_USE_NODE_PRIORITY=1 to make sure fine-grained GTP ' \
+                    'comms can be well overlapped with GEMMs when CudaGraph is enabled for ' \
+                    'Blackwell and later architecture.'
+
+        # Sanity check for 'NCCL_PROTO'.
+        if os.environ.get('NCCL_PROTO', '').lower() == "simple":
+            warn_rank_0(
+                "Generally GTP prefers 'NCCL_PROTO=LL128 or LL' while get 'NCCL_PROTO=simple', "
+                "force setting NCCL_PROTO=Simple might introduce bad perf."
+            )
+
+        assert not args.ddp_average_in_collective, (
+            "GTP requires --ddp-average-in-collective off (the default); averaged collectives "
+            "would need per-buffer 1/gtp scaling."
+        )
+
+        assert args.ckpt_format in ('torch', 'torch_dist'), (
+            f"GTP supports only --ckpt-format 'torch' (legacy) or 'torch_dist', got "
+            f"'{args.ckpt_format}'."
+        )
+        assert not (
+            getattr(args, 'dist_ckpt_optim_fully_reshardable', False)
+            and getattr(args, 'distrib_optim_fully_reshardable_mem_efficient', False)
+        ), (
+            "GTP does not support the distributed-optimizer fully-reshardable + "
+            "mem-efficient checkpoint mode. Disable "
+            "--distrib-optim-fully-reshardable-mem-efficient (or "
+            "--dist-ckpt-optim-fully-reshardable)."
+        )
+
+        # Propagate --fp8-param-gather into GTPConfig: enables optimizer-side
+        # FP32->FP8 cast for GTP shards, so the forward skips BF16->FP8.
+        if getattr(args, 'fp8_param_gather', False):
+            from megatron.core.tensor_parallel.gtp import update_gtp_config
+
+            update_gtp_config(fp8_param_gather=True)
+            warn_rank_0(
+                "GTP + --fp8-param-gather: setting "
+                "GTPConfig.fp8_param_gather=True (optimizer step "
+                "pre-quantizes GTP shards, skipping the per-forward "
+                "BF16->FP8 cast)."
+            )
+
     # Disable bias gelu fusion if we are disabling bias altogether
     if not args.add_bias_linear:
         args.bias_gelu_fusion = False
@@ -2109,6 +2195,10 @@ def _add_network_size_args(parser):
         "bias_dropout_fusion",
         "apply_rope_fusion",
         "mamba_training_ssm_states_dtype",
+        # internal/derived: controlled only via --tensor-parallel-num-weight-shards
+        "gtp_weight_remat_size",
+        # internal/derived: controlled only via --expert-tensor-parallel-num-weight-shards
+        "expert_gtp_weight_remat_size",
     ]
     transformer_factory = ArgumentGroupFactory(TransformerConfig, exclude=exclude)
     transformer_group = transformer_factory.build_group(parser, "transformer configuration")
diff --git a/megatron/training/checkpointing.py b/megatron/training/checkpointing.py
index 27b275c3017..9984be081f5 100644
--- a/megatron/training/checkpointing.py
+++ b/megatron/training/checkpointing.py
@@ -833,10 +833,12 @@ def iter_finalize_fn():
                 with open_file(tracker_filename, 'w') as f:
                     f.write("release" if release else str(iteration))
                 tensor_rank_to_print = (tensor_rank if tensor_rank is not None else mpu.get_tensor_model_parallel_rank()) + 1
+                gtp_rank_to_print = mpu.get_gtp_weight_remat_rank() + 1
                 pipeline_rank_to_print = (pipeline_rank if pipeline_rank is not None else mpu.get_pipeline_model_parallel_rank()) + 1
                 print_rank_0(f"  [{datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')}] successfully saved "
                              f"checkpoint from iteration {int(iteration):7d} to {args.save} "
                              f"[ t {tensor_rank_to_print}/{mpu.get_tensor_model_parallel_world_size()}, "
+                             f"gtp {gtp_rank_to_print}/{mpu.get_gtp_weight_remat_world_size()}, "
                              f"p {pipeline_rank_to_print}/{mpu.get_pipeline_model_parallel_world_size()} ]")
                 if args.log_progress and args.async_save:
                     append_to_progress_log(args.save, f'Saved async checkpoint\tIteration: {iteration}',
@@ -2116,8 +2118,11 @@ def load_model_state_dict(module, state_dict, strict: bool):
     _tp_w = get_pg_size(tp_group) if tp_group is not None else mpu.get_tensor_model_parallel_world_size()
     _pp_r = get_pg_rank(pp_group) if pp_group is not None else mpu.get_pipeline_model_parallel_rank()
     _pp_w = get_pg_size(pp_group) if pp_group is not None else mpu.get_pipeline_model_parallel_world_size()
+    _gtp_r = mpu.get_gtp_weight_remat_rank()
+    _gtp_w = mpu.get_gtp_weight_remat_world_size()
     print_rank_0(f'  successfully loaded checkpoint from {load_dir} '
                  f'[ t {_tp_r + 1}/{_tp_w}, '
+                 f'gtp {_gtp_r + 1}/{_gtp_w}, '
                  f'p {_pp_r + 1}/{_pp_w} ] '
                  f'at iteration {iteration}')
 
diff --git a/megatron/training/initialize.py b/megatron/training/initialize.py
index faf64847fb3..8aa882b69e2 100644
--- a/megatron/training/initialize.py
+++ b/megatron/training/initialize.py
@@ -357,12 +357,25 @@ def _initialize_distributed(get_embedding_ranks, get_position_embedding_ranks, s
         if mpu.model_parallel_is_initialized():
             print("model parallel is already initialized")
         else:
+            if args.gtp_weight_remat_size > 1 or args.expert_gtp_weight_remat_size > 1:
+                from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+                assert HAVE_GTP, (
+                    "GTP requires TransformerEngine >= 2.17. "
+                    "Set MEGATRON_GTP_FORCE_ENABLE=1 to bypass for custom TE builds, "
+                    "or set both --gtp-weight-remat-size and "
+                    "--expert-generalized-tensor-parallel-remat-size to 1."
+                )
             mpu.initialize_model_parallel(
                 args.tensor_model_parallel_size,
                 args.pipeline_model_parallel_size,
                 args.virtual_pipeline_model_parallel_size,
                 pipeline_model_parallel_comm_backend=args.pipeline_model_parallel_comm_backend,
                 use_sharp=args.use_sharp,
+                # GTP/EGTP require world_size divisible by TP*PP*CP*GTP (and the expert grid
+                # by ETP*EP*PP*EGTP). Inactive when the remat sizes are 1.
+                gtp_remat_size=args.gtp_weight_remat_size,
+                expert_gtp_remat_size=args.expert_gtp_weight_remat_size,
                 context_parallel_size=args.context_parallel_size,
                 hierarchical_context_parallel_sizes=args.hierarchical_context_parallel_sizes,
                 hybrid_context_parallel=args.hybrid_context_parallel,
diff --git a/megatron/training/models/dist_utils.py b/megatron/training/models/dist_utils.py
index 28dde6d6501..9c9fffa4d63 100644
--- a/megatron/training/models/dist_utils.py
+++ b/megatron/training/models/dist_utils.py
@@ -150,7 +150,7 @@ def _print_num_params(model: list[MegatronModule], pg_collection: ProcessGroupCo
     """Print the number of parameters in the model on rank 0.
 
     Only prints on data parallel rank 0 to avoid duplicate output.
-    Shows parameter count per (tensor parallel, pipeline parallel) rank.
+    Shows parameter count per (tensor parallel, gtp, pipeline parallel) rank.
 
     Args:
         model: List of model modules to count parameters from
@@ -158,8 +158,9 @@ def _print_num_params(model: list[MegatronModule], pg_collection: ProcessGroupCo
     """
     if (pg_collection.dp.rank() == 0) and (pg_collection.cp.rank() == 0):
         print(
-            " > number of parameters on (tensor, pipeline) model parallel rank ({}, {}): {}".format(
+            " > number of parameters on (tensor, gtp, pipeline) model parallel rank ({}, {}, {}): {}".format(
                 pg_collection.tp.rank(),
+                pg_collection.gtp.rank(),
                 pg_collection.pp.rank(),
                 sum([sum([p.nelement() for p in model_module.parameters()]) for model_module in model]),
             ),
diff --git a/megatron/training/training.py b/megatron/training/training.py
index e825853fdb1..13463fee32a 100644
--- a/megatron/training/training.py
+++ b/megatron/training/training.py
@@ -114,6 +114,7 @@
     get_rerun_state_machine,
 )
 from megatron.core.resharding.refit import swap_model_weights
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
 from megatron.core.transformer.cuda_graphs import TECudaGraphHelper
 from megatron.core.transformer.experimental_attention_variant.dsa import DSAIndexerLossLoggingHelper
 from megatron.core.transformer.module import Float16Module
@@ -295,6 +296,23 @@ def print_datetime(string, override_timestamp=None):
     print_rank_0(f'[{string}] datetime: {time_str} ')
 
 
+def reset_gtp_quantize_cache_after_load(model):
+    """Invalidate GTP's per-shard low-precision cache after a checkpoint load.
+
+    GTP keeps a per-shard low-precision cache (``self.quantized``) that survives the
+    in-place writes to ``.data`` performed by DCP load. Reset it so the first forward
+    after resume re-quantizes from the freshly-loaded BF16 weight instead of reusing
+    the stale pre-load cast (which otherwise spikes lm-loss for one iteration before
+    normal training overwrites the cache). No-op when GTP is unavailable.
+    """
+    if not HAVE_GTP:
+        return
+    from megatron.core.tensor_parallel.gtp import reset_gtp_quantize_cache
+
+    for m in model:
+        reset_gtp_quantize_cache(m)
+
+
 def update_seqlen_stats_from_cu_seqlens(cu_seqlens):
     """Add ``sum(L_i)`` and ``sum(L_i ** 2)`` from one micro-batch's REAL ``cu_seqlens``.
 
@@ -1629,8 +1647,11 @@ def wrap_model_chunks_with_ddp(
                 "wrap_model_chunks_with_ddp requires a dp_cp process group to size "
                 "the distributed-optimizer parameter layout"
             )
-            data_parallel_world_size = get_pg_size(layout_pgs.dp_cp)
-            expert_data_parallel_world_size = get_pg_size(getattr(layout_pgs, "expt_dp", None))
+            # Size the layout for the replicate (gtp/egtp-EXCLUDED) DP group the DDP buffer
+            # actually shards over, so DDP can use it directly without recomputing. no_gtp
+            # aliases the regular DP group when GTP is inactive.
+            data_parallel_world_size = get_pg_size(layout_pgs.dp_cp_no_gtp)
+            expert_data_parallel_world_size = get_pg_size(getattr(layout_pgs, "expt_dp_no_egtp", None))
             for i, (chunk, bucket_size) in enumerate(zip(model_chunks, bucket_sizes)):
                 all_params = [p for p in chunk.parameters() if p.requires_grad]
                 per_chunk_layouts[i] = compute_layout(
@@ -1696,6 +1717,20 @@ def get_model(model_provider_func, model_type=ModelType.encoder_or_decoder, wrap
             # For distillation ckpts without ModelOpt state
             args.modelopt_enabled = True
 
+    # Configure GTP padding alignment based on quantization recipe before model construction.
+    if (
+        getattr(args, 'gtp_weight_remat_size', 1) > 1
+        or getattr(args, 'expert_gtp_weight_remat_size', 1) > 1
+    ):
+        from megatron.core.tensor_parallel.gtp import update_gtp_config
+
+        if getattr(args, 'fp4', None) is not None:
+            update_gtp_config(pad_for_alignment=16)
+        elif getattr(args, 'fp8_recipe', None) == 'mxfp8':
+            update_gtp_config(pad_for_alignment=32)
+        elif getattr(args, 'fp8', None) is not None:
+            update_gtp_config(pad_for_alignment=16)
+
     # Build model.
     def build_model():
         if (
@@ -1744,6 +1779,38 @@ def build_model():
     if not isinstance(model, list):
         model = [model]
 
+    # Classify each GTP param into its prefetch chain (GRAPHED vs UNGRAPHED)
+    # from args.cuda_graph_modules + moe_shared_expert_overlap. Must run after
+    # model build, before the first forward (which lazily builds chain links).
+    if (
+        getattr(args, 'gtp_weight_remat_size', 1) > 1
+        or getattr(args, 'expert_gtp_weight_remat_size', 1) > 1
+    ):
+        from megatron.core.tensor_parallel.gtp import (
+            GTP_CONFIG,
+            classify_gtp_chains,
+            reset_gtp_state,
+            set_cuda_graph_modules,
+            tag_gtp_params_with_names,
+        )
+
+        _raw_modules = getattr(args, 'cuda_graph_modules', None) or []
+        _cg_modules = {getattr(s, 'name', str(s)) for s in _raw_modules} if _raw_modules else None
+        _mse_overlap = getattr(args, 'moe_shared_expert_overlap', False)
+        # cuda_graph_impl lets the classifier tell "CG disabled" from "full-iteration /
+        # graph-every-layer" — both have empty cuda_graph_modules.
+        set_cuda_graph_modules(
+            _cg_modules,
+            moe_shared_expert_overlap=_mse_overlap,
+            cuda_graph_impl=getattr(args, 'cuda_graph_impl', 'none'),
+        )
+        # Clear stale process-global chain state so a rebuilt model starts fresh.
+        reset_gtp_state()
+        for model_module in model:
+            tag_gtp_params_with_names(model_module)
+            classify_gtp_chains(model_module)
+        print_rank_0(f"GTP enabled. {GTP_CONFIG}")
+
     # For rare operations like post-training logits saving
     if args.freeze_all_layers:
         for model_module in model:
@@ -1767,9 +1834,10 @@ def build_model():
     )
     if get_pg_rank(pg_collection.dp) == 0 and get_pg_rank(pg_collection.cp) == 0:
         print(
-            ' > number of parameters on (tensor, pipeline) '
-            'model parallel rank ({}, {}): {}'.format(
+            ' > number of parameters on (tensor, gtp, pipeline) '
+            'model parallel rank ({}, {}, {}): {}'.format(
                 get_pg_rank(pg_collection.tp),
+                get_pg_rank(pg_collection.gtp),
                 get_pg_rank(pg_collection.pp),
                 num_parameters,
             ),
@@ -2118,6 +2186,7 @@ def setup_model_and_optimizer(
             and getattr(args, "use_torch_fsdp2", False)
             and args.ckpt_format == "torch_dist",
         )
+        reset_gtp_quantize_cache_after_load(model)
         timers('load-checkpoint').stop(barrier=True)
         timers.log(['load-checkpoint'])
         one_logger and one_logger.log_metrics(
@@ -3259,6 +3328,7 @@ def train(
                     and getattr(args, "use_torch_fsdp2", False)
                     and args.ckpt_format == "torch_dist",
                 )
+            reset_gtp_quantize_cache_after_load(model)
             ref_state_dict = {k: (v.cpu() if v is not None else v) for k, v in model[0].state_dict().items()}
 
             # Reload RL training checkpoint weights
@@ -3274,6 +3344,7 @@ def train(
                     and getattr(args, "use_torch_fsdp2", False)
                     and args.ckpt_format == "torch_dist",
                 )
+            reset_gtp_quantize_cache_after_load(model)
 
             args.no_load_optim = no_load_optim
 
diff --git a/megatron/training/utils/common_utils.py b/megatron/training/utils/common_utils.py
index 316bf598fec..bb1e17d8a8e 100644
--- a/megatron/training/utils/common_utils.py
+++ b/megatron/training/utils/common_utils.py
@@ -48,6 +48,35 @@
 from megatron.training import get_adlr_autoresume, get_args, get_timers
 
 
+
+def _compute_norm_2(params_list):
+    """Compute squared L2 norm of a list of tensors. Returns a CUDA scalar."""
+    if len(params_list) > 0:
+        dummy_overflow_buf = torch.tensor([0], dtype=torch.int, device='cuda')
+        norm, _ = multi_tensor_applier(
+            multi_tensor_l2norm, dummy_overflow_buf, [params_list], False,
+        )
+        return norm * norm
+    return torch.zeros((1,), dtype=torch.float32, device='cuda')
+
+
+def _get_param_data(param, force_create_fp32_copy, bf16):
+    """Extract the appropriate data tensor from a param for norm computation.
+
+    Returns (data_tensor, is_sharded) where is_sharded indicates the param has
+    a sharded main_param from the distributed optimizer.
+    """
+    if bf16:
+        if not force_create_fp32_copy and hasattr(param, 'main_param'):
+            if getattr(param, 'main_param_sharded', False):
+                if param.main_param is not None:
+                    return param.main_param, True
+                return None, True
+            return param.main_param, False
+        return param.data.float(), False
+    return param.data, False
+
+
 def calc_params_l2_norm(model, force_create_fp32_copy=False):
     """Calculate l2 norm of parameters"""
     args = get_args()
@@ -70,129 +99,110 @@ def calc_params_l2_norm(model, force_create_fp32_copy=False):
 
         return calc_dtensor_params_l2_norm(params)
 
-    # Seperate moe and dense params
-    params_data = []
-    moe_params_data = []
-    sharded_params_data = []
-    data_parallel_group = None
+    # 8 buckets: 4 categories × (non-sharded, sharded optimizer main_param).
+    # Each category needs different reduction groups.
+    params_data = []                # Dense, non-sharded
+    sharded_params_data = []        # Dense, sharded → reduce over dp_cp
+    gtp_params_data = []            # GTP, non-sharded
+    gtp_sharded_params_data = []    # GTP, sharded → reduce over dp_cp_no_gtp
+    moe_params_data = []            # MoE, non-sharded
+    moe_sharded_params_data = []    # MoE, sharded → reduce over expert_dp
+    moe_gtp_params_data = []        # MoE-GTP, non-sharded
+    moe_gtp_sharded_params_data = []  # MoE-GTP, sharded → reduce over expert_dp_no_gtp
+
+    gtp_rank = mpu.get_gtp_weight_remat_rank()
+    egtp_rank = mpu.get_expert_gtp_weight_remat_rank()
 
     for model_chunk in model:
         for param in model_chunk.parameters():
-            data_parallel_group = get_data_parallel_group_if_dtensor(param, data_parallel_group)
-            is_not_tp_duplicate = param_is_not_tensor_parallel_duplicate(param)
-            if not is_not_tp_duplicate:
+            is_gtp = getattr(param, 'is_gtp', False)
+
+            # Filter TP duplicates. GTP params are always unique across TP ranks
+            # so skip this check for them.
+            if not is_gtp and not param_is_not_tensor_parallel_duplicate(param):
                 continue
-            assert is_not_tp_duplicate
-            if not getattr(param, 'allreduce', True):
+            is_expert = not getattr(param, 'allreduce', True)
+
+            # Filter GTP duplicates: non-GTP params are replicated across GTP ranks.
+            if is_expert:
+                if not is_gtp and egtp_rank != 0:
+                    continue
+            else:
+                if not is_gtp and gtp_rank != 0:
+                    continue
+
+            # Route to the correct bucket.
+            if is_expert:
                 assert param_is_not_shared(param)
                 param = to_local_if_dtensor(param)
-                if args.bf16:
-                    if not force_create_fp32_copy and hasattr(param, 'main_param'):
-                        if getattr(param, 'main_param_sharded', False):
-                            if param.main_param is not None:
-                                sharded_params_data.append(param.main_param)
-                        else:
-                            moe_params_data.append(param.main_param)
-                    else:
-                        # Fallback to original logic of making a fp32 copy of the
-                        # parameter if `.main_param` attribute is not available.
-                        moe_params_data.append(param.data.float())
+                data, is_sharded = _get_param_data(param, force_create_fp32_copy, args.bf16)
+                if data is None:
+                    continue
+                if is_gtp:
+                    (moe_gtp_sharded_params_data if is_sharded else moe_gtp_params_data).append(data)
                 else:
-                    moe_params_data.append(param.data)
+                    (moe_sharded_params_data if is_sharded else moe_params_data).append(data)
             else:
                 if param_is_not_shared(param):
                     param = to_local_if_dtensor(param)
-                    if args.bf16:
-                        if not force_create_fp32_copy and hasattr(param, 'main_param'):
-                            if getattr(param, 'main_param_sharded', False):
-                                if param.main_param is not None:
-                                    sharded_params_data.append(param.main_param)
-                            else:
-                                params_data.append(param.main_param)
-                        else:
-                            # Fallback to original logic of making a fp32 copy of the
-                            # parameter if `.main_param` attribute is not available.
-                            params_data.append(param.data.float())
+                    data, is_sharded = _get_param_data(param, force_create_fp32_copy, args.bf16)
+                    if data is None:
+                        continue
+                    if is_gtp:
+                        (gtp_sharded_params_data if is_sharded else gtp_params_data).append(data)
                     else:
-                        params_data.append(param.data)
-
-    # Calculate norm.
-    dummy_overflow_buf = torch.tensor([0], dtype=torch.int, device='cuda')
-    if len(params_data) > 0:
-        norm, _ = multi_tensor_applier(
-            multi_tensor_l2norm, dummy_overflow_buf, [params_data], False  # no per-parameter norm.
-        )
-        norm_2 = norm * norm
-    else:
-        norm_2 = torch.zeros((1,), dtype=torch.float32, device='cuda')
-
-    if data_parallel_group is not None:
-        torch.distributed.all_reduce(
-            norm_2, op=torch.distributed.ReduceOp.SUM, group=data_parallel_group
-        )
-
-    # Add norm contribution from params with sharded main_params. These norms need to be
-    # accumulated across the DP group since the main parameters are sharded because
-    # of distributed optimizer.
-    if len(sharded_params_data) > 0:
-        dummy_overflow_buf = torch.tensor([0], dtype=torch.int, device='cuda')
-        sharded_norm, _ = multi_tensor_applier(
-            multi_tensor_l2norm,
-            dummy_overflow_buf,
-            [sharded_params_data],
-            False,  # no per-parameter norm.
-        )
-        sharded_norm_2 = sharded_norm * sharded_norm
-    else:
-        sharded_norm_2 = torch.zeros((1,), dtype=torch.float32, device='cuda')
-    # Sum over all DP groups, including CP since distributed optimizer state is
-    # sharded jointly over DP+CP.
-    torch.distributed.all_reduce(
-        sharded_norm_2,
-        op=torch.distributed.ReduceOp.SUM,
-        group=mpu.get_data_parallel_group(with_context_parallel=True)
+                        (sharded_params_data if is_sharded else params_data).append(data)
+
+    # --- Compute local norm^2 for each bucket ---
+    params_norm_2 = _compute_norm_2(params_data)
+    sharded_norm_2 = _compute_norm_2(sharded_params_data)
+    gtp_norm_2 = _compute_norm_2(gtp_params_data)
+    gtp_sharded_norm_2 = _compute_norm_2(gtp_sharded_params_data)
+    moe_norm_2 = _compute_norm_2(moe_params_data)
+    moe_sharded_norm_2 = _compute_norm_2(moe_sharded_params_data)
+    moe_gtp_norm_2 = _compute_norm_2(moe_gtp_params_data)
+    moe_gtp_sharded_norm_2 = _compute_norm_2(moe_gtp_sharded_params_data)
+
+    def _sum_reduce(tensor, group):
+        torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.SUM, group=group)
+
+    # --- Sharded optimizer DP reductions (each category uses its own group) ---
+    # Reduce over the gtp-EXCLUDED replicate group: the model-parallel reduce below already
+    # spans the gtp axis, so a gtp-inclusive group here would over-count by gtp. No-op for
+    # non-GTP runs (the no_gtp group aliases the regular DP group).
+    _sum_reduce(
+        sharded_norm_2, mpu.get_data_parallel_group(with_context_parallel=True, no_gtp=True)
     )
-    norm_2 += sharded_norm_2
-
-    # Add norm contribution from expert layers in MoEs.
-    if len(moe_params_data) > 0:
-        moe_norm, _ = multi_tensor_applier(
-            multi_tensor_l2norm,
-            dummy_overflow_buf,
-            [moe_params_data],
-            False,  # no per-parameter norm.
-        )
-        moe_norm_2 = moe_norm * moe_norm
+    _sum_reduce(
+        gtp_sharded_norm_2, mpu.get_data_parallel_group(with_context_parallel=True, no_gtp=True)
+    )
+    _sum_reduce(moe_sharded_norm_2, mpu.get_expert_data_parallel_group())
+    _sum_reduce(moe_gtp_sharded_norm_2, mpu.get_expert_data_parallel_group(no_gtp=True))
 
-    # Account for MoE norm even if current rank doesn't have any expert params to prevent
-    # hang in models with un-even numbers of MoE layers.
-    # See details in https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/issues/409
-    else:
-        moe_norm_2 = torch.zeros_like(norm_2)
+    # --- Combine dense + GTP norms ---
+    # model_parallel group = TP×GTP×PP, so GTP reduction is implicit.
+    norm_2 = params_norm_2 + sharded_norm_2 + gtp_norm_2 + gtp_sharded_norm_2
 
-    # Reduce norm across model parallel groups (dense and expert).
-    # Dense params should sum across all model-parallel GPUs (tensor + pipeline).
+    # --- Combine MoE + MoE-GTP norms ---
+    # expert_model_parallel = TP×EP×PP (does NOT include EGTP), so we need
+    # an explicit EGTP reduction for MoE-GTP before the model-parallel reduce.
+    moe_gtp_combined_norm_2 = moe_gtp_norm_2 + moe_gtp_sharded_norm_2
+    _sum_reduce(moe_gtp_combined_norm_2, mpu.get_expert_gtp_weight_remat_group())
+    moe_total_norm_2 = moe_norm_2 + moe_sharded_norm_2 + moe_gtp_combined_norm_2
+
+    # --- Model-parallel reductions ---
     dense_reduce_group = mpu.get_model_parallel_group()
-    ranks_in_dense_reduce_group = torch.distributed.get_process_group_ranks(dense_reduce_group)
-    # Expert params should sum across all model-parallel GPUs (expert + tensor + pipeline).
     expert_reduce_group = mpu.get_expert_tensor_model_pipeline_parallel_group()
+    ranks_in_dense_reduce_group = torch.distributed.get_process_group_ranks(dense_reduce_group)
     ranks_in_expert_reduce_group = torch.distributed.get_process_group_ranks(expert_reduce_group)
 
-    # If dense and expert reduce groups are the same, sum then reduce.
     if ranks_in_dense_reduce_group == ranks_in_expert_reduce_group:
-        norm_2 += moe_norm_2
-        torch.distributed.all_reduce(
-            norm_2, op=torch.distributed.ReduceOp.SUM, group=dense_reduce_group
-        )
-    # If dense and expert reduce groups are different, reduce then sum.
+        norm_2 += moe_total_norm_2
+        _sum_reduce(norm_2, dense_reduce_group)
     else:
-        torch.distributed.all_reduce(
-            norm_2, op=torch.distributed.ReduceOp.SUM, group=dense_reduce_group
-        )
-        torch.distributed.all_reduce(
-            moe_norm_2, op=torch.distributed.ReduceOp.SUM, group=expert_reduce_group
-        )
-        norm_2 += moe_norm_2
+        _sum_reduce(norm_2, dense_reduce_group)
+        _sum_reduce(moe_total_norm_2, expert_reduce_group)
+        norm_2 += moe_total_norm_2
 
     return norm_2.item() ** 0.5
 
diff --git a/tests/unit_tests/generalized_tensor_parallel/__init__.py b/tests/unit_tests/generalized_tensor_parallel/__init__.py
new file mode 100644
index 00000000000..b5dff7b5663
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
diff --git a/tests/unit_tests/generalized_tensor_parallel/gtp_test_utils.py b/tests/unit_tests/generalized_tensor_parallel/gtp_test_utils.py
new file mode 100644
index 00000000000..7af4c4c83bb
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/gtp_test_utils.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Shared fixtures and helpers for all GTP unit tests.
+"""
+
+import pytest
+import torch
+import transformer_engine.pytorch as te
+from transformer_engine.pytorch import is_mxfp8_available, is_nvfp4_available
+from transformer_engine.pytorch.quantization import FP8GlobalStateManager
+
+from megatron.core.tensor_parallel.gtp import GTPShardedParam
+from tests.unit_tests.test_utilities import Utils
+
+# ---------------------------------------------------------------------------
+# Fixtures (import into each test module so pytest discovers them)
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture(scope="module", autouse=True)
+def _torchrun_dist_init():
+    """Initialize the torchrun-managed dist group once per module."""
+    Utils.initialize_model_parallel()
+    yield
+    Utils.destroy_model_parallel()
+
+
+@pytest.fixture(autouse=True)
+def reset_fp8_state():
+    yield
+    FP8GlobalStateManager.reset()
+
+
+@pytest.fixture(autouse=True)
+def reset_gtp_globals():
+    """Reset GTP mutable class-level state between tests."""
+    yield
+    GTPShardedParam._chain_state = {}
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _run_distributed(fn, required_world_size: int, *args) -> None:
+    """Run ``fn(rank, world_size, port, *args)`` on every torchrun rank.
+
+    ``port`` is unused (dist already initialized by torchrun) but kept so
+    worker signatures don't need editing.
+    """
+    actual_world_size = torch.distributed.get_world_size()
+    if actual_world_size != required_world_size:
+        pytest.skip(
+            f"Requires world_size={required_world_size}, "
+            f"got {actual_world_size} (launch with torchrun --nproc-per-node={required_world_size})"
+        )
+    fn(torch.distributed.get_rank(), actual_world_size, None, *args)
+
+
+def _requires_multi_gpu(n: int = 4):
+    if torch.cuda.device_count() < n:
+        pytest.skip(f"Requires at least {n} CUDA devices")
+
+
+def _requires_mxfp8():
+    available, reason = is_mxfp8_available(return_reason=True)
+    if not available:
+        pytest.skip(f"MXFP8 not available: {reason}")
+
+
+def _requires_nvfp4():
+    if not is_nvfp4_available():
+        pytest.skip("NVFP4 not available (requires compute capability >= 10.0)")
+
+
+def _make_gtp_linear(in_f, out_f, gtp_group, dtype=torch.bfloat16, **kwargs):
+    """Construct a bias-free GTP-sharded te.Linear on CUDA."""
+    return te.Linear(
+        in_features=in_f,
+        out_features=out_f,
+        bias=False,
+        params_dtype=dtype,
+        device="cuda",
+        gtp_group=gtp_group,
+        **kwargs,
+    )
+
+
+def _make_gtp_grouped_linear(num_gemms, in_f, out_f, gtp_group, dtype=torch.bfloat16, **kwargs):
+    """Construct a bias-free GTP-sharded te.GroupedLinear on CUDA."""
+    return te.GroupedLinear(
+        num_gemms=num_gemms,
+        in_features=in_f,
+        out_features=out_f,
+        bias=False,
+        params_dtype=dtype,
+        device="cuda",
+        gtp_group=gtp_group,
+        **kwargs,
+    )
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_attention_gtp.py b/tests/unit_tests/generalized_tensor_parallel/test_attention_gtp.py
new file mode 100644
index 00000000000..8b983bef032
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_attention_gtp.py
@@ -0,0 +1,248 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Integration tests for GTP + Attention (TransformerLayer) correctness.
+
+Test groups
+-----------
+TestAttentionGTPCorrectness  - GTP TransformerLayer loss trajectory matches baseline (no-GTP)
+                               over 10 training steps using MXFP8 and Nemotron3-Super proxy
+                               hyperparameters.
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TransformerEngine >= 2.17", allow_module_level=True)
+
+from transformer_engine.pytorch import fp8_autocast
+
+from megatron.core.tensor_parallel.gtp import GTPShardedParam
+from tests.unit_tests.generalized_tensor_parallel.gtp_test_utils import (
+    _requires_mxfp8,
+    _run_distributed,
+    _torchrun_dist_init,
+    reset_fp8_state,
+    reset_gtp_globals,
+)
+
+# ---------------------------------------------------------------------------
+# Attention GTP correctness: per-step loss trajectory baseline vs GTP=4
+# ---------------------------------------------------------------------------
+
+
+def _worker_attention_gtp_correctness(rank, world_size, port):
+    """Verify GTP TransformerLayer produces the same per-step loss as a no-GTP baseline.
+
+    Phase 1 — GTP=1, DP=4:
+        All 4 ranks hold the full model and process identical inputs.  Gradients
+        are identical across ranks (no all-reduce needed).  Weight update:
+            param.data -= lr * param.grad
+
+    Phase 2 — GTP=4, DP=1:
+        All linear weights (QKV proj, output proj, MLP fc1/fc2) sharded across
+        4 ranks.  After backward, wgrad reduce-scatter sums each shard's wgrad:
+            main_grad[rank_i] = gtp_size * dW[shard_i]
+        The optimizer divides by gtp_size to recover the per-element gradient:
+            param.data -= (lr / gtp_size) * param.main_grad
+
+    Both phases use identical initial weights (synced from rank 0 in Phase 1,
+    restored as shards in Phase 2) and identical step-by-step inputs.
+
+    Nemotron3-Super proxy hyperparameters:
+        hidden=4096, num_heads=32 (head_dim=128), ffn_hidden_size=16384 (=4xhidden)
+    MXFP8 alignment with GTP=4:
+        QKV shard: 3x4096/4=3072, 3072%32=0 ✓; proj shard: 4096/4=1024, 1024%32=0 ✓
+        fc1 shard: 16384/4=4096, 4096%32=0 ✓; fc2 shard: 4096/4=1024, 1024%32=0 ✓
+    """
+    from transformer_engine.common.recipe import MXFP8BlockScaling
+    from transformer_engine.pytorch.quantization import FP8GlobalStateManager
+
+    from megatron.core import parallel_state as ps
+    from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    HIDDEN = 4096
+    NUM_HEADS = 32  # head_dim = HIDDEN / NUM_HEADS = 128
+    FFN_HIDDEN = 16384  # = 4 x HIDDEN (default GPT FFN ratio)
+    NUM_LAYERS = 2
+    SEQ = 32
+    BATCH = 1
+    LR = 0.01
+    STEPS = 10
+    dtype = torch.bfloat16
+    recipe = MXFP8BlockScaling()
+
+    def make_config():
+        return TransformerConfig(
+            num_attention_heads=NUM_HEADS,
+            num_layers=NUM_LAYERS,
+            hidden_size=HIDDEN,
+            ffn_hidden_size=FFN_HIDDEN,
+            add_bias_linear=False,
+            params_dtype=dtype,
+            hidden_dropout=0.0,
+            attention_dropout=0.0,
+            bias_dropout_fusion=False,
+            fp8='e4m3',
+            tensor_model_parallel_size=1,
+            pipeline_model_parallel_size=1,
+        )
+
+    def make_transformer_stack(config, pg_collection):
+        spec = get_gpt_layer_with_transformer_engine_spec()
+        return torch.nn.ModuleList(
+            [
+                spec.module(
+                    config, spec.submodules, layer_number=i + 1, pg_collection=pg_collection
+                )
+                for i in range(NUM_LAYERS)
+            ]
+        )
+
+    def run_step(layers, x):
+        with fp8_autocast(enabled=True, fp8_recipe=recipe):
+            for layer in layers:
+                x, _ = layer(x, attention_mask=None)
+        return x.mean()
+
+    # -------------------------------------------------------------------------
+    # Phase 1: Baseline — GTP=1 (DP=4)
+    # -------------------------------------------------------------------------
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=1
+    )
+    model_parallel_cuda_manual_seed(42)
+
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    config = make_config()
+    layers = make_transformer_stack(config, pg_collection)
+    for layer in layers:
+        layer.cuda()
+
+    # Verify baseline has no GTP sharding (gtp_remat_size=1 should leave plain parameters).
+    assert not any(
+        isinstance(p, GTPShardedParam) for p in layers.parameters()
+    ), "Baseline GTP=1 stack should have no GTPShardedParam"
+
+    # Synchronize weights from rank 0 across all DP ranks.
+    for p in layers.parameters():
+        dist.broadcast(p.data, src=0)
+
+    # Save initial weights; will be used to initialize the GTP model identically.
+    saved_weights = {n: p.data.clone() for n, p in layers.named_parameters()}
+
+    baseline_losses = []
+    for step in range(STEPS):
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+
+        loss = run_step(layers, x)
+        if rank == 0:
+            baseline_losses.append(loss.item())
+
+        loss.backward()
+        with torch.no_grad():
+            for p in layers.parameters():
+                if p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+    FP8GlobalStateManager.reset()
+
+    # -------------------------------------------------------------------------
+    # Phase 2: GTP=4 (DP=1)
+    # -------------------------------------------------------------------------
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=4
+    )
+    model_parallel_cuda_manual_seed(42)
+
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    config = make_config()
+    layers_gtp = make_transformer_stack(config, pg_collection)
+    for layer in layers_gtp:
+        layer.cuda()
+
+    gtp_group = ps.get_gtp_weight_remat_group()
+    gtp_size = gtp_group.size()
+    gtp_rank = gtp_group.rank()
+
+    # Verify GTP is truly active: linear weights must be GTPShardedParam instances.
+    gtp_params = [p for p in layers_gtp.parameters() if isinstance(p, GTPShardedParam)]
+    assert (
+        len(gtp_params) > 0
+    ), "GTP is not active: no GTPShardedParam found in GTP=4 transformer stack"
+
+    # Restore initial weights: GTP params get the matching shard, others get the full tensor.
+    for name, p in layers_gtp.named_parameters():
+        full = saved_weights[name]
+        if isinstance(p, GTPShardedParam):
+            shard_size = p.shape[0]
+            p.data.copy_(full[gtp_rank * shard_size : (gtp_rank + 1) * shard_size])
+        else:
+            p.data.copy_(full)
+
+    # Pre-allocate main_grad for GTP params (required before the first backward).
+    for p in layers_gtp.parameters():
+        if isinstance(p, GTPShardedParam):
+            p.main_grad = torch.zeros(p.shape, dtype=dtype, device='cuda')
+
+    gtp_losses = []
+    for step in range(STEPS):
+        for p in layers_gtp.parameters():
+            if isinstance(p, GTPShardedParam):
+                p.main_grad.zero_()
+
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+
+        loss = run_step(layers_gtp, x)
+        if rank == 0:
+            gtp_losses.append(loss.item())
+
+        loss.backward()
+
+        # After RS, main_grad = gtp_size * dW_shard.  Divide by gtp_size to match baseline.
+        with torch.no_grad():
+            for p in layers_gtp.parameters():
+                if isinstance(p, GTPShardedParam):
+                    p.data.sub_((LR / gtp_size) * p.main_grad)
+                elif p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    # -------------------------------------------------------------------------
+    # Compare per-step loss trajectories on rank 0
+    # -------------------------------------------------------------------------
+    if rank == 0:
+        assert len(baseline_losses) == STEPS
+        assert len(gtp_losses) == STEPS
+        for step, (lb, lg) in enumerate(zip(baseline_losses, gtp_losses)):
+            print(f"Step {step:2d}: baseline={lb:.6f}  gtp={lg:.6f}", flush=True)
+        torch.testing.assert_close(
+            torch.tensor(gtp_losses), torch.tensor(baseline_losses), atol=1e-5, rtol=1e-5
+        )
+
+
+class TestAttentionGTPCorrectness:
+    def test_attention_gtp_loss_trajectory_matches_baseline(self):
+        """GTP TransformerLayer per-step losses must match no-GTP baseline (atol=1e-5, rtol=1e-5; MXFP8, Nemotron3-Super proxy)."""
+        _requires_mxfp8()
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires at least 4 CUDA devices")
+        _run_distributed(_worker_attention_gtp_correctness, 4)
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_gtp.py b/tests/unit_tests/generalized_tensor_parallel/test_gtp.py
new file mode 100644
index 00000000000..3ebff6b8e56
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_gtp.py
@@ -0,0 +1,1489 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Unit tests for Generalized Tensor Parallelism (GTP).
+
+Test groups
+-----------
+1.  TestGTPWeightState           - state-machine transitions (single-process)
+2.  TestGTPWeightCache           - coat-check buffer pool + reserve/get/release semantics (single-process)
+3.  TestGTPSharding              - wrap_module_params_gtp: shard content + padding (multi-GPU)
+4.  TestWrapModuleParams         - wrap_module_params_gtp: param replacement + weight_list (multi-GPU)
+5.  TestLinearGTP                - Linear forward/backward numerical correctness (multi-GPU)
+6.  TestLayerNormLinearGTP       - LayerNormLinear forward/backward smoke test (multi-GPU)
+7.  TestGroupedLinearGTP         - GroupedLinear forward/backward smoke test (multi-GPU)
+8.  TestGTPPrefetchChain         - linked-list next_w/prev_w wiring (multi-GPU)
+9.  TestGTPWgradRS               - wgrad reduce-scatter shape + multi-layer deferred path (multi-GPU)
+10. TestGTPMicrobatches          - output consistency across microbatches (multi-GPU)
+11. TestNVFP4LinearGTP           - Linear + NVFP4 recipe: quantized shard setup, fwd/bwd (multi-GPU)
+12. TestNVFP4GroupedLinearGTP    - GroupedLinear + NVFP4 recipe: coalesced AG + fwd/bwd (multi-GPU)
+13. TestMXFP8LinearGTP           - Linear + MXFP8 recipe: quantized shard setup, fwd/bwd, padding (multi-GPU)
+14. TestGTPConfig                - update_gtp_config: valid/invalid keys (single-process)
+15. TestGTPShardedParamProperties - shape computations, get_padded_shard, _strip_padding (single-process)
+16. TestGTPCacheKey              - _get_cache_key: expert vs non-expert, fwd vs bwd (single-process)
+17. TestTagGTPParamsWithNames    - _debug_name population on GTPShardedParam (single-process)
+18. TestGTPGroupSizeOne          - wrap_module_params_gtp no-op when gtp_group.size()==1 (single-process)
+19. TestGTPPrefetchDisabled      - weight_prefetch=False: single-pass forward still works (multi-GPU)
+20. TestFuseWgradAccumulation    - fuse_wgrad_accumulation=True: wgrad→main_grad (multi-GPU)
+21. TestGTPGradAccumHook         - main_grad updated after reduce-scatter backward (multi-GPU)
+22. TestWaitAsyncCommsFallback   - wait_async_comms(finalize_after_drain=True) inline-accumulation fallback when _wgrad_rs_handle is None (single-process)
+23. TestGTPDDPBucketAlignment    - GTP and regular DDP buffer bucket ends padded for dist-opt alignment (multi-GPU)
+24. TestGTPDDPGradReadyWiring    - GTP params drive DDP grad-ready via the manual hook after wgrad add, not autograd (multi-GPU)
+
+Multi-GPU tests skip when ``torch.distributed.get_world_size()`` doesn't match the required
+world size (4 for everything in this file).
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TransformerEngine >= 2.17", allow_module_level=True)
+
+import transformer_engine.pytorch as te
+from transformer_engine.common.recipe import NVFP4BlockScaling
+from transformer_engine.pytorch import fp8_autocast
+from transformer_engine.pytorch.quantized_tensor import QuantizedTensor
+
+import megatron.core.tensor_parallel.generalized_tensor_parallelism as gtp_module
+from megatron.core.tensor_parallel.generalized_tensor_parallelism import (
+    GTPWeightCache,
+    GTPWeightState,
+)
+from megatron.core.tensor_parallel.gtp import GTPShardedParam, wrap_module_params_gtp
+from tests.unit_tests.generalized_tensor_parallel.gtp_test_utils import (
+    _make_gtp_grouped_linear,
+    _make_gtp_linear,
+    _requires_multi_gpu,
+    _requires_mxfp8,
+    _requires_nvfp4,
+    _run_distributed,
+    _torchrun_dist_init,
+    reset_fp8_state,
+    reset_gtp_globals,
+)
+
+
+class _FakeGroup:
+    """Minimal mock for a dist process group — used in single-process unit tests."""
+
+    def __init__(self, size=1, rank=0):
+        self._size = size
+        self._rank = rank
+
+    def size(self):
+        return self._size
+
+    def rank(self):
+        return self._rank
+
+
+# ---------------------------------------------------------------------------
+# 1. GTPWeightState - state-machine transition tests
+# ---------------------------------------------------------------------------
+
+
+class TestGTPWeightState:
+
+    @staticmethod
+    def _param():
+        return GTPShardedParam(torch.zeros(4, 4))
+
+    def test_full_cycle(self):
+        p = self._param()
+        assert p.state == GTPWeightState.NONE
+        p._set_state(GTPWeightState.ASYNC_WAIT)
+        p._set_state(GTPWeightState.DATA_READY)
+        p._set_state(GTPWeightState.NONE)
+        assert p.state == GTPWeightState.NONE
+
+    def test_sync_path_cycle(self):
+        """NONE → DATA_READY_SYNC → NONE (sync all-gather path)."""
+        p = self._param()
+        p._set_state(GTPWeightState.DATA_READY_SYNC)
+        p._set_state(GTPWeightState.NONE)
+        assert p.state == GTPWeightState.NONE
+
+    def test_rs_state_full_cycle(self):
+        """RS state machine: NONE → ASYNC_WAIT → DATA_READY → NONE."""
+        p = self._param()
+        assert p.rs_state == GTPWeightState.NONE
+        p._set_rs_state(GTPWeightState.ASYNC_WAIT)
+        p._set_rs_state(GTPWeightState.DATA_READY)
+        p._set_rs_state(GTPWeightState.NONE)
+        assert p.rs_state == GTPWeightState.NONE
+
+
+# ---------------------------------------------------------------------------
+# 2. GTPWeightCache - coat-check buffer pool tests
+# ---------------------------------------------------------------------------
+
+
+class TestGTPWeightCache:
+
+    def _param(self, shape=(8, 4), gtp_size=2):
+        p = GTPShardedParam(torch.zeros(*shape))
+        p.group = _FakeGroup(size=gtp_size)
+        p.expert_idx = None
+        p.pad_length = 0
+        p._quantizer = None
+        return p
+
+    def test_reserve_returns_ticket(self):
+        cache = GTPWeightCache()
+        p = self._param()
+        ticket = cache.reserve(p, torch.bfloat16, fwd=True)
+        assert isinstance(ticket, int)
+
+    def test_reserve_get_roundtrip(self):
+        cache = GTPWeightCache()
+        p = self._param()
+        ticket = cache.reserve(p, torch.bfloat16, fwd=True)
+        buf = cache.get(ticket)
+        assert buf is not None
+        # get() returns same buf on second call (buf cached in slot)
+        buf2 = cache.get(ticket)
+        assert buf2 is buf
+
+    def test_buffer_reused_after_release(self):
+        cache = GTPWeightCache()
+        p = self._param()
+        t1 = cache.reserve(p, torch.bfloat16, fwd=True)
+        buf1 = cache.get(t1)
+        cache.release(t1)
+        # Reserve a new ticket, buf should come from pool
+        t2 = cache.reserve(p, torch.bfloat16, fwd=True)
+        buf2 = cache.get(t2)
+        assert buf1 is buf2, "Buffer should be reused from pool after release"
+        cache.release(t2)
+
+    def test_two_simultaneous_reserves_are_distinct(self):
+        cache = GTPWeightCache()
+        p = self._param()
+        t1 = cache.reserve(p, torch.bfloat16, fwd=True)
+        buf1 = cache.get(t1)
+        t2 = cache.reserve(p, torch.bfloat16, fwd=True)
+        buf2 = cache.get(t2)
+        assert buf1 is not buf2, "Concurrent reserves must get distinct buffers"
+
+    def test_tickets_are_unique(self):
+        """Each reserve() call returns a new unique ticket."""
+        cache = GTPWeightCache()
+        p = self._param()
+        t1 = cache.reserve(p, torch.bfloat16, fwd=True)
+        t2 = cache.reserve(p, torch.bfloat16, fwd=True)
+        assert t1 != t2, "Each reserve() must return a unique ticket"
+
+    def test_invalid_ticket_raises(self):
+        cache = GTPWeightCache()
+        with pytest.raises(KeyError):
+            cache.get(9999)
+
+    def test_different_shapes_use_distinct_pool_slots(self):
+        cache = GTPWeightCache()
+        p1 = self._param(shape=(8, 4))
+        p2 = self._param(shape=(16, 4))
+        t1 = cache.reserve(p1, torch.bfloat16, fwd=True)
+        buf1 = cache.get(t1)
+        t2 = cache.reserve(p2, torch.bfloat16, fwd=True)
+        buf2 = cache.get(t2)
+        assert buf1.shape != buf2.shape
+        cache.release(t1)
+        cache.release(t2)
+
+    def test_without_release_pool_stays_empty(self):
+        """Without release(), subsequent reserves allocate fresh buffers."""
+        cache = GTPWeightCache()
+        p = self._param()
+        t1 = cache.reserve(p, torch.bfloat16, fwd=True)
+        buf1 = cache.get(t1)
+        # Do NOT release t1 — pool stays empty
+        t2 = cache.reserve(p, torch.bfloat16, fwd=True)
+        buf2 = cache.get(t2)
+        assert buf2 is not buf1, "Without release, a fresh buffer must be allocated"
+
+    def test_release_invalid_ticket_raises(self):
+        cache = GTPWeightCache()
+        with pytest.raises(KeyError):
+            cache.release(9999)
+
+    def test_fwd_bwd_tickets_are_distinct(self):
+        """fwd=True and fwd=False reserves always receive distinct ticket IDs."""
+        cache = GTPWeightCache()
+        p = self._param()
+        t_fwd = cache.reserve(p, torch.bfloat16, fwd=True)
+        t_bwd = cache.reserve(p, torch.bfloat16, fwd=False)
+        assert t_fwd != t_bwd
+
+
+# ---------------------------------------------------------------------------
+# 3. GTP weight sharding: shard content and alignment padding
+# ---------------------------------------------------------------------------
+
+
+def _worker_sharding_aligned(rank, world_size, port):
+    K, M = world_size * 32, 16  # K divisible by 16*world_size → no padding
+    full_weight = torch.arange(K * M, dtype=torch.float32).reshape(K, M).cuda()
+    dist.broadcast(full_weight, src=0)
+
+    gtp_group = dist.new_group(list(range(world_size)))
+    mod = nn.Module()
+    mod.weight = nn.Parameter(full_weight.clone(), requires_grad=False)
+    wrap_module_params_gtp(mod, ["weight"], gtp_group)
+    shard = mod.weight
+
+    rows_per_rank = K // world_size
+    assert shard.shape == (rows_per_rank, M), f"rank {rank}: unexpected shape {shard.shape}"
+    assert shard.pad_length == 0
+    expected = full_weight[rank * rows_per_rank : (rank + 1) * rows_per_rank]
+    assert torch.allclose(shard.data, expected), f"rank {rank}: shard content mismatch"
+
+
+def _worker_sharding_padding(rank, world_size, port):
+    alignment = 16 * world_size
+    K = alignment - 1  # deliberately unaligned
+    M = 16
+    full_weight = torch.ones(K, M, dtype=torch.float32).cuda()
+    dist.broadcast(full_weight, src=0)
+
+    gtp_group = dist.new_group(list(range(world_size)))
+    mod = nn.Module()
+    mod.weight = nn.Parameter(full_weight.clone(), requires_grad=False)
+    wrap_module_params_gtp(mod, ["weight"], gtp_group)
+    shard = mod.weight
+
+    padded_K = alignment
+    rows_per_rank = padded_K // world_size
+
+    if rank == world_size - 1:
+        assert shard.pad_length > 0
+        # The shard tensor holds only the real rows; get_padded_shard() appends zero rows.
+        padded = shard.get_padded_shard()
+        assert (
+            padded.shape[0] == rows_per_rank
+        ), f"rank {rank}: expected padded shard {rows_per_rank} rows, got {padded.shape[0]}"
+        n_real = K - rank * rows_per_rank
+        assert torch.all(padded[n_real:] == 0), "Padding rows must be zero"
+    else:
+        # pad_length is set globally on every rank's shard (slicer attaches the
+        # global padding amount), so we don't assert anything about it here —
+        # only the last rank's shard contains the actual padding rows.
+        assert (
+            shard.shape[0] == rows_per_rank
+        ), f"rank {rank}: expected {rows_per_rank} rows, got {shard.shape[0]}"
+
+
+class TestGTPSharding:
+    def test_aligned_shard_content(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_sharding_aligned, 4)
+
+    def test_unaligned_shard_padding(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_sharding_padding, 4)
+
+
+# ---------------------------------------------------------------------------
+# 4. wrap_module_params_gtp: param replacement and GroupedLinear weight_list
+# ---------------------------------------------------------------------------
+
+
+def _worker_linear_param_replaced(rank, world_size, port):
+    in_f, out_f = 64, 128
+    gtp_group = dist.new_group(list(range(world_size)))
+    layer = _make_gtp_linear(in_f, out_f, gtp_group)
+    w = layer.weight
+    assert isinstance(w, GTPShardedParam), "weight must be GTPShardedParam"
+    assert w.shape == (out_f // world_size, in_f), f"unexpected shard shape {w.shape}"
+    assert w.group is gtp_group
+
+
+def _worker_grouped_weight_list(rank, world_size, port):
+    num_gemms, in_f, out_f = 3, 32, 64
+    gtp_group = dist.new_group(list(range(world_size)))
+    layer = _make_gtp_grouped_linear(num_gemms, in_f, out_f, gtp_group)
+    w0 = layer.weight0
+    assert isinstance(w0, GTPShardedParam)
+    assert w0.weight_list is not None
+    assert len(w0.weight_list) == num_gemms
+    assert [w.expert_idx for w in w0.weight_list] == list(range(num_gemms))
+
+
+class TestWrapModuleParams:
+    def test_linear_weight_replaced(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_linear_param_replaced, 4)
+
+    def test_grouped_linear_weight_list(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_grouped_weight_list, 4)
+
+
+# ---------------------------------------------------------------------------
+# 5. Linear forward/backward numerical correctness
+# ---------------------------------------------------------------------------
+
+
+def _worker_linear_correctness(rank, world_size, port):
+    """GTP output == (all-gathered weight) @ input, and dX matches."""
+    torch.manual_seed(0)
+    batch, in_f, out_f = 16, 64, 128  # out_f % (16*world_size)==0 → no padding
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+
+    # Reconstruct full weight from shards (all-gather)
+    shard = layer.weight.data.clone()
+    all_shards = [torch.zeros_like(shard) for _ in range(world_size)]
+    dist.all_gather(all_shards, shard, group=gtp_group)
+    full_weight = torch.cat(all_shards, dim=0).float()[:out_f]  # strip any padding
+
+    # Shared input across ranks
+    inp = torch.randn(batch, in_f, dtype=dtype, device="cuda")
+    dist.broadcast(inp, src=0)
+
+    inp_gtp = inp.clone().requires_grad_(True)
+    inp_ref = inp.clone().requires_grad_(True)
+
+    # GTP forward
+    out_gtp = layer(inp_gtp, is_first_microbatch=True)
+
+    # Reference forward
+    out_ref = inp_ref.float() @ full_weight.T
+    out_ref = out_ref.to(dtype)
+
+    assert out_gtp.shape == out_ref.shape, f"Shape mismatch {out_gtp.shape} vs {out_ref.shape}"
+    assert torch.allclose(
+        out_gtp.float(), out_ref.float(), atol=1e-5, rtol=1e-5
+    ), f"Output mismatch max_diff={(out_gtp.float()-out_ref.float()).abs().max():.4f}"
+
+    # wgrad RS path always accumulates into main_grad; allocate before backward.
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+
+    # Backward: compare input gradient
+    grad_out = torch.randn_like(out_gtp)
+    dist.broadcast(grad_out, src=0)
+    out_gtp.backward(grad_out)
+    out_ref.backward(grad_out.float())
+
+    assert inp_gtp.grad is not None
+    assert torch.allclose(
+        inp_gtp.grad.float(), inp_ref.grad.float(), atol=1e-5, rtol=1e-5
+    ), f"dX mismatch max_diff={(inp_gtp.grad.float()-inp_ref.grad.float()).abs().max():.4f}"
+
+
+class TestLinearGTP:
+    def test_forward_backward_correctness(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_linear_correctness, 4)
+
+
+# ---------------------------------------------------------------------------
+# 6. LayerNormLinear forward/backward smoke test
+# ---------------------------------------------------------------------------
+
+
+def _worker_layernorm_linear(rank, world_size, port):
+    torch.manual_seed(0)
+    seq, batch, in_f, out_f = 4, 2, 64, 128
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = te.LayerNormLinear(
+        in_features=in_f,
+        out_features=out_f,
+        bias=False,
+        params_dtype=dtype,
+        device="cuda",
+        gtp_group=gtp_group,
+    )
+    assert isinstance(layer.weight, GTPShardedParam)
+
+    inp = torch.randn(seq, batch, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    out = layer(inp, is_first_microbatch=True)
+    assert out.shape == (seq, batch, out_f), f"unexpected output shape {out.shape}"
+
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+    out.sum().backward()
+    assert inp.grad is not None and inp.grad.shape == inp.shape
+
+
+class TestLayerNormLinearGTP:
+    def test_forward_backward(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_layernorm_linear, 4)
+
+
+# ---------------------------------------------------------------------------
+# 7. GroupedLinear forward/backward smoke test
+# ---------------------------------------------------------------------------
+
+
+def _worker_grouped_linear(rank, world_size, port, num_gemms):
+    torch.manual_seed(0)
+    in_f, out_f, total_tokens = 32, 64, num_gemms * 4
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_grouped_linear(num_gemms, in_f, out_f, gtp_group, dtype)
+    assert isinstance(layer.weight0, GTPShardedParam)
+
+    m_splits = [total_tokens // num_gemms] * num_gemms
+    m_splits[-1] += total_tokens - sum(m_splits)
+
+    inp = torch.randn(total_tokens, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    out = layer(inp, m_splits=m_splits, is_first_microbatch=True)
+    assert out.shape == (total_tokens, out_f), f"unexpected output shape {out.shape}"
+
+    for i in range(num_gemms):
+        w = getattr(layer, f"weight{i}")
+        w.main_grad = torch.zeros(w.shape, dtype=dtype, device="cuda")
+    out.sum().backward()
+    assert inp.grad is not None and inp.grad.shape == inp.shape
+
+
+class TestGroupedLinearGTP:
+    @pytest.mark.parametrize("num_gemms", [2, 4])
+    def test_forward_backward(self, num_gemms):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_grouped_linear, 4, num_gemms)
+
+
+# ---------------------------------------------------------------------------
+# 8. Prefetch chain: next_w / prev_w wiring after first forward pass
+# ---------------------------------------------------------------------------
+
+
+def _worker_chain_wired(rank, world_size, port):
+    torch.manual_seed(0)
+    in_f, out_f = 32, 64
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    l0 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    l1 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+
+    inp = torch.randn(4, in_f, dtype=dtype, device="cuda")
+    dist.broadcast(inp, src=0)
+
+    # First forward pass builds the linked list
+    l0(inp, is_first_microbatch=True)
+    l1(inp, is_first_microbatch=True)
+
+    w0, w1 = l0.weight, l1.weight
+    assert w0.next_w is w1, "w0.next_w should point to w1"
+    assert w1.prev_w is w0, "w1.prev_w should point back to w0"
+    assert w1.next_w is None
+    assert w0.prev_w is None
+
+
+def _worker_chain_async_prefetch(rank, world_size, port):
+    """On the second forward pass, w1 should be in DATA_READY before its forward runs."""
+    torch.manual_seed(0)
+    in_f, out_f = 32, 64
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    l0 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    l1 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+
+    inp = torch.randn(4, in_f, dtype=dtype, device="cuda")
+    dist.broadcast(inp, src=0)
+
+    # First pass builds chain, second pass uses async prefetch
+    for _ in range(2):
+        out = l0(inp, is_first_microbatch=True) + l1(inp, is_first_microbatch=True)
+    assert torch.isfinite(out).all(), "Non-finite output on second pass"
+
+
+class TestGTPPrefetchChain:
+    def test_chain_wired_after_first_pass(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_chain_wired, 4)
+
+    def test_async_prefetch_second_pass(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_chain_async_prefetch, 4)
+
+
+# ---------------------------------------------------------------------------
+# 9. Wgrad reduce-scatter: shape and deferred async path
+# ---------------------------------------------------------------------------
+
+
+def _worker_wgrad_shape(rank, world_size, port):
+    """After backward, weight.grad shape must match the local shard shape."""
+    torch.manual_seed(0)
+    in_f, out_f = 32, 64
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype, fuse_wgrad_accumulation=False)
+    inp = torch.randn(8, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+    layer(inp, is_first_microbatch=True).sum().backward()
+
+    w = layer.weight
+    if w.grad is not None:
+        assert w.grad.shape == w.shape, f"wgrad shape {w.grad.shape} != shard shape {w.shape}"
+
+
+def _worker_multilayer_deferred_rs(rank, world_size, port):
+    """Two-layer GTP: async RS deferred for layer0 (non-last), sync for layer1 (last in bwd)."""
+    torch.manual_seed(0)
+    in_f, out_f = 32, 64
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    l0 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    l1 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+
+    inp = torch.randn(8, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    # wgrad RS path always accumulates into main_grad; allocate before backward.
+    l0.weight.main_grad = torch.zeros(l0.weight.shape, dtype=dtype, device="cuda")
+    l1.weight.main_grad = torch.zeros(l1.weight.shape, dtype=dtype, device="cuda")
+
+    out = l0(inp, is_first_microbatch=True) + l1(inp, is_first_microbatch=True)
+    out.sum().backward()
+
+    # Both weights' main_grad should have been updated
+    for lyr in [l0, l1]:
+        w = lyr.weight
+        assert w.main_grad is not None, f"No main_grad on {lyr.__class__.__name__}.weight"
+
+
+class TestGTPWgradRS:
+    def test_wgrad_shape_matches_shard(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_wgrad_shape, 4)
+
+    def test_multilayer_deferred_rs(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_multilayer_deferred_rs, 4)
+
+
+# ---------------------------------------------------------------------------
+# 10. Multiple microbatches: output must be consistent when weight unchanged
+# ---------------------------------------------------------------------------
+
+
+def _worker_microbatches(rank, world_size, port):
+    torch.manual_seed(0)
+    batch, in_f, out_f = 8, 64, 128
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    inp = torch.randn(batch, in_f, dtype=dtype, device="cuda")
+    dist.broadcast(inp, src=0)
+
+    # First microbatch
+    out1 = layer(inp, is_first_microbatch=True).detach().clone()
+
+    # Second microbatch with same weight (skip_weight_cast=True path)
+    out2 = layer(inp, is_first_microbatch=False).detach()
+
+    assert torch.allclose(
+        out1, out2
+    ), f"Microbatch outputs differ; max_diff={(out1-out2).abs().max():.6f}"
+
+
+class TestGTPMicrobatches:
+    def test_consistent_across_microbatches(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_microbatches, 4)
+
+
+# ---------------------------------------------------------------------------
+# 11. NVFP4 + GTP: Linear forward/backward, quantized shard setup
+# ---------------------------------------------------------------------------
+
+
+def _worker_nvfp4_linear(rank, world_size, port):
+    """Verify that GTP Linear correctly quantizes, all-gathers, and computes with NVFP4."""
+    torch.manual_seed(0)
+    # batch=32: NVFP4 wgrad GEMM (K=batch) requires K divisible by 32
+    batch, in_f, out_f = 32, 64, 128  # out_f % (16*world_size)==0 → no padding
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    inp = torch.randn(batch, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    # Forward under NVFP4 recipe - triggers setup() and NVFP4 quantization
+    recipe = NVFP4BlockScaling()
+    with fp8_autocast(enabled=True, fp8_recipe=recipe):
+        out = layer(inp, is_first_microbatch=True)
+
+    # After the first forward pass setup() must have created a quantized shard
+    w = layer.weight
+    assert w.quantized is not None, "NVFP4 quantized shard must be set after setup()"
+    assert isinstance(
+        w.quantized, QuantizedTensor
+    ), f"weight.quantized should be QuantizedTensor, got {type(w.quantized)}"
+
+    assert out.shape == (batch, out_f), f"unexpected output shape {out.shape}"
+    assert torch.isfinite(out).all(), "NVFP4 GTP output has non-finite values"
+
+    # Second microbatch reuses cached quantized weight (skip_weight_cast path)
+    with fp8_autocast(enabled=True, fp8_recipe=recipe):
+        out2 = layer(inp.detach(), is_first_microbatch=False)
+    assert torch.isfinite(out2).all(), "NVFP4 GTP second-microbatch output has non-finite values"
+
+
+def _worker_nvfp4_linear_unaligned(rank, world_size, port):
+    """Verify NVFP4 GTP when out_features is not aligned to 16*world_size (padding path).
+
+    out_f is chosen to be divisible by 8 (satisfies NVFP4 GEMM alignment) but not by
+    16*world_size (so padding is needed). The last GTP rank receives a shard that is
+    zero-padded to reach the shard_size boundary. After all-gather, _strip_padding
+    removes the padded rows from the gathered weight before the GEMM, so the output
+    has the original out_f columns.
+    """
+    torch.manual_seed(0)
+    alignment = 16 * world_size  # 64 for world_size=4
+    # Choose out_f divisible by 8 (NVFP4 GEMM constraint) but not by 64 (GTP alignment).
+    # With out_f=56: pad_length=8, shard_size=16, last rank gets 8 rows padded to 16.
+    out_f = alignment - 8  # 56 for world_size=4
+    in_f = 64
+    batch = 32
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    inp = torch.randn(batch, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    with fp8_autocast(enabled=True, fp8_recipe=NVFP4BlockScaling()):
+        out = layer(inp, is_first_microbatch=True)
+
+    # After _strip_padding removes the padded rows, output has out_f (not padded) cols.
+    assert out.shape == (batch, out_f), f"unexpected output shape {out.shape}"
+    assert torch.isfinite(out).all(), "NVFP4 GTP (unaligned) output has non-finite values"
+
+
+class TestNVFP4LinearGTP:
+    def test_forward_backward(self):
+        _requires_nvfp4()
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_nvfp4_linear, 4)
+
+    def test_forward_unaligned_padding(self):
+        _requires_nvfp4()
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_nvfp4_linear_unaligned, 4)
+
+
+# ---------------------------------------------------------------------------
+# 12. NVFP4 + GTP: GroupedLinear forward/backward (coalesced batched all-gather)
+# ---------------------------------------------------------------------------
+
+
+def _worker_nvfp4_grouped_linear(rank, world_size, port, num_gemms):
+    """Verify NVFP4 GTP with GroupedLinear (uses grouped_gather_along_first_dim)."""
+    torch.manual_seed(0)
+    # NVFP4 split_quantize constraints: in_f % 128 == 0, tokens_per_expert % 64 == 0
+    # (Hadamard transform requirement), and K=tokens_per_expert % 32 == 0 for wgrad.
+    in_f, out_f, total_tokens = 128, 256, num_gemms * 64
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_grouped_linear(num_gemms, in_f, out_f, gtp_group, dtype)
+    assert isinstance(layer.weight0, GTPShardedParam)
+
+    m_splits = [total_tokens // num_gemms] * num_gemms
+    m_splits[-1] += total_tokens - sum(m_splits)
+
+    inp = torch.randn(total_tokens, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    with fp8_autocast(enabled=True, fp8_recipe=NVFP4BlockScaling()):
+        out = layer(inp, m_splits=m_splits, is_first_microbatch=True)
+
+    assert out.shape == (total_tokens, out_f), f"unexpected output shape {out.shape}"
+    assert torch.isfinite(out).all(), "NVFP4 GroupedLinear GTP output has non-finite values"
+
+    # All expert weight shards should be quantized after setup()
+    for i in range(num_gemms):
+        name = f"weight{i}"
+        w = getattr(layer, name)
+        assert isinstance(w, GTPShardedParam)
+        assert w.quantized is not None, f"{name}.quantized not set after NVFP4 setup()"
+        assert isinstance(
+            w.quantized, QuantizedTensor
+        ), f"{name}.quantized should be QuantizedTensor, got {type(w.quantized)}"
+
+    for i in range(num_gemms):
+        w = getattr(layer, f"weight{i}")
+        w.main_grad = torch.zeros(w.shape, dtype=dtype, device="cuda")
+    out.sum().backward()
+    assert inp.grad is not None and inp.grad.shape == inp.shape
+
+
+class TestNVFP4GroupedLinearGTP:
+    @pytest.mark.parametrize("num_gemms", [2, 4])
+    def test_forward_backward(self, num_gemms):
+        _requires_nvfp4()
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_nvfp4_grouped_linear, 4, num_gemms)
+
+
+# ---------------------------------------------------------------------------
+# 13. MXFP8 + GTP: Linear forward/backward, quantized shard setup
+# ---------------------------------------------------------------------------
+
+
+def _worker_mxfp8_linear(rank, world_size, port):
+    """Verify that GTP Linear correctly quantizes, all-gathers, and computes with MXFP8."""
+    from transformer_engine.common.recipe import MXFP8BlockScaling
+
+    torch.manual_seed(0)
+    # batch=32: MXFP8 wgrad GEMM (K=batch) requires K divisible by MXFP8_BLOCK_SCALING_SIZE=32
+    batch, in_f, out_f = 32, 64, 128  # out_f % (16*world_size)==0 → no padding
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    inp = torch.randn(batch, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    # Forward under MXFP8 recipe - triggers setup() and MXFP8 quantization
+    recipe = MXFP8BlockScaling()
+    with fp8_autocast(enabled=True, fp8_recipe=recipe):
+        out = layer(inp, is_first_microbatch=True)
+
+    # After the first forward pass setup() must have created a quantized shard
+    w = layer.weight
+    assert w.quantized is not None, "MXFP8 quantized shard must be set after setup()"
+    assert isinstance(
+        w.quantized, QuantizedTensor
+    ), f"weight.quantized should be QuantizedTensor, got {type(w.quantized)}"
+
+    assert out.shape == (batch, out_f), f"unexpected output shape {out.shape}"
+    assert torch.isfinite(out).all(), "MXFP8 GTP output has non-finite values"
+
+    # Backward should complete without error
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+    out.sum().backward()
+    assert inp.grad is not None
+    assert inp.grad.shape == inp.shape
+
+    # Second microbatch reuses cached quantized weight (skip_weight_cast path)
+    with fp8_autocast(enabled=True, fp8_recipe=recipe):
+        out2 = layer(inp.detach(), is_first_microbatch=False)
+    assert torch.isfinite(out2).all(), "MXFP8 GTP second-microbatch output has non-finite values"
+
+
+def _worker_mxfp8_linear_unaligned(rank, world_size, port):
+    """Verify MXFP8 GTP when out_features is not aligned to 16*world_size (padding path).
+
+    MXFP8 requires tensor dims divisible by 32, so shard_size (= M_padded / world_size)
+    must be a multiple of 32. With world_size=4 this requires M_padded % 128 == 0.
+    out_f=120 gives M_padded=128, shard_size=32 (32 % 32 == 0). The last rank has
+    24 real rows zero-padded to 32. After all-gather, _strip_padding removes the padded
+    rows before the GEMM, so the output has the original out_f columns.
+    """
+    from transformer_engine.common.recipe import MXFP8BlockScaling
+
+    torch.manual_seed(0)
+    # out_f=120: M_padded=128, shard_size=32, last rank has 24 rows padded to 32.
+    # 120 is divisible by 8 (GEMM constraint), not by 64 (GTP alignment → padding needed).
+    out_f = 120
+    in_f = 64
+    batch = 32
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+    inp = torch.randn(batch, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    with fp8_autocast(enabled=True, fp8_recipe=MXFP8BlockScaling()):
+        out = layer(inp, is_first_microbatch=True)
+
+    # After _strip_padding removes the padded rows, output has out_f (not padded) cols.
+    assert out.shape == (batch, out_f), f"unexpected output shape {out.shape}"
+    assert torch.isfinite(out).all(), "MXFP8 GTP (unaligned) output has non-finite values"
+
+
+class TestMXFP8LinearGTP:
+    def test_forward_backward(self):
+        _requires_mxfp8()
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_mxfp8_linear, 4)
+
+    def test_forward_unaligned_padding(self):
+        _requires_mxfp8()
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_mxfp8_linear_unaligned, 4)
+
+
+# ---------------------------------------------------------------------------
+# 14. GTPConfig / update_gtp_config
+# ---------------------------------------------------------------------------
+
+
+class TestGTPConfig:
+
+    def test_update_pad_for_alignment(self):
+        original = gtp_module.GTP_CONFIG.pad_for_alignment
+        try:
+            gtp_module.update_gtp_config(pad_for_alignment=8)
+            assert gtp_module.GTP_CONFIG.pad_for_alignment == 8
+        finally:
+            gtp_module.update_gtp_config(pad_for_alignment=original)
+
+    def test_update_weight_prefetch(self):
+        original = gtp_module.GTP_CONFIG.weight_prefetch
+        try:
+            gtp_module.update_gtp_config(weight_prefetch=False)
+            assert gtp_module.GTP_CONFIG.weight_prefetch is False
+        finally:
+            gtp_module.update_gtp_config(weight_prefetch=original)
+
+    def test_invalid_key_raises(self):
+        with pytest.raises(ValueError, match="Unknown GTP config option"):
+            gtp_module.update_gtp_config(nonexistent_key=123)
+
+
+# ---------------------------------------------------------------------------
+# 15. GTPShardedParam properties - shape computations and padding
+# ---------------------------------------------------------------------------
+
+
+class TestGTPShardedParamProperties:
+
+    def _make_param(self, shape, pad_length=0, group_size=4, group_rank=0):
+        p = GTPShardedParam(torch.zeros(*shape))
+        p.group = _FakeGroup(size=group_size, rank=group_rank)
+        p.pad_length = pad_length
+        p.expert_idx = None
+        return p
+
+    # --- _unsharded_shape_padded ---
+
+    def test_unsharded_shape_padded_no_padding(self):
+        # shape=(8, 4), group_size=4 → 8*4=32 rows, no padding
+        p = self._make_param((8, 4), pad_length=0, group_size=4, group_rank=2)
+        assert p._unsharded_shape_padded == (32, 4)
+
+    def test_unsharded_shape_padded_last_rank_with_padding(self):
+        # Local shard includes its slice of padding rows: 16 rows per rank,
+        # pad_length=1 marks 1 of those (on the last rank) as pad → padded
+        # unsharded shape = 16 * 4 = 64.  pad_length is global metadata, the
+        # same value lives on every rank's shard.
+        p = self._make_param((16, 32), pad_length=1, group_size=4, group_rank=3)
+        assert p._unsharded_shape_padded == (64, 32)
+
+    def test_unsharded_shape_padded_non_last_rank_with_padding(self):
+        # Non-last rank: pad_length is the same global value, same formula.
+        p = self._make_param((16, 32), pad_length=1, group_size=4, group_rank=0)
+        assert p._unsharded_shape_padded == (64, 32)
+
+    # --- _unsharded_shape ---
+
+    def test_unsharded_shape_no_padding(self):
+        p = self._make_param((8, 4), pad_length=0, group_size=4, group_rank=0)
+        assert p._unsharded_shape == (32, 4)
+
+    def test_unsharded_shape_strips_padding(self):
+        # Local 16 rows × 4 ranks = 64 padded; pad_length=1 → unsharded = 63.
+        p = self._make_param((16, 32), pad_length=1, group_size=4, group_rank=3)
+        assert p._unsharded_shape == (63, 32)
+
+    # --- get_padded_shard ---
+
+    def test_get_padded_shard_identity_when_no_padding(self):
+        p = self._make_param((6, 4), pad_length=0)
+        result = p.get_padded_shard()
+        assert result is p  # identity - no copy needed
+
+    def test_get_padded_shard_identity_non_last_rank(self):
+        # pad_length > 0 but not the padded last rank → no padding added
+        p = self._make_param((16, 4), pad_length=1, group_size=4, group_rank=0)
+        result = p.get_padded_shard()
+        assert result is p
+
+    def test_get_padded_shard_identity_last_rank(self):
+        # Under current semantics the local shard already contains its share
+        # of padding (slicer F.pads with zeros before slicing), so
+        # get_padded_shard() is the identity on the last rank too.
+        p = self._make_param((8, 4), pad_length=2, group_size=4, group_rank=3)
+        assert p.get_padded_shard() is p
+
+    # --- _strip_padding ---
+
+    def test_strip_padding_identity_no_padding(self):
+        p = self._make_param((8, 4), pad_length=0)
+        t = torch.randn(32, 4)
+        assert p._strip_padding(t) is t
+
+    def test_strip_padding_plain_tensor(self):
+        # Gathered weight [32, 4] with pad_length=1 → strip 1 row → [31, 4]
+        p = self._make_param((7, 4), pad_length=1, group_size=4, group_rank=0)
+        t = torch.randn(32, 4)
+        result = p._strip_padding(t)
+        assert result.shape == (31, 4)
+        assert torch.equal(result, t[:-1])
+
+    def test_strip_padding_multi_row(self):
+        # pad_length=4 strips 4 rows
+        p = self._make_param((12, 8), pad_length=4, group_size=4, group_rank=0)
+        t = torch.ones(64, 8)
+        result = p._strip_padding(t)
+        assert result.shape == (60, 8)
+
+
+# ---------------------------------------------------------------------------
+# 16. _get_cache_key - expert vs non-expert, fwd vs bwd
+# ---------------------------------------------------------------------------
+
+
+class TestGTPCacheKey:
+
+    def _param(self, shape=(16, 32), expert_idx=None):
+        p = GTPShardedParam(torch.zeros(*shape))
+        p.group = _FakeGroup(size=4)
+        p.expert_idx = expert_idx
+        p.pad_length = 0
+        return p
+
+    def test_non_expert_key_same_for_fwd_bwd(self):
+        """Non-routed params produce the same cache key for fwd and bwd."""
+        p = self._param(expert_idx=None)
+        assert p._get_cache_key(torch.bfloat16, fwd=True, reduce_scatter=False) == p._get_cache_key(
+            torch.bfloat16, fwd=False, reduce_scatter=False
+        )
+
+    def test_expert_key_differs_fwd_bwd(self):
+        """For quantized (non-torch.dtype) recipes, expert fwd vs bwd keys differ."""
+        p = self._param(expert_idx=0)
+        # _get_cache_key differentiates fwd/bwd only for non-torch.dtype objects
+        # (e.g. quantized recipe dtype descriptors).  Use a mock to trigger that path.
+        mock_dtype = "fp8"
+        assert p._get_cache_key(mock_dtype, fwd=True, reduce_scatter=False) != p._get_cache_key(
+            mock_dtype, fwd=False, reduce_scatter=False
+        )
+
+    def test_different_expert_idx_different_keys(self):
+        """Two experts with same shape but different indices get distinct keys."""
+        p0 = self._param(expert_idx=0)
+        p1 = self._param(expert_idx=1)
+        assert p0._get_cache_key(
+            torch.bfloat16, fwd=True, reduce_scatter=False
+        ) != p1._get_cache_key(torch.bfloat16, fwd=True, reduce_scatter=False)
+
+    def test_same_expert_idx_same_key(self):
+        """Same-shaped experts with the same idx share a cache key (cross-layer buffer reuse)."""
+        p_l0 = self._param(expert_idx=0)
+        p_l1 = self._param(expert_idx=0)
+        assert p_l0._get_cache_key(
+            torch.bfloat16, fwd=True, reduce_scatter=False
+        ) == p_l1._get_cache_key(torch.bfloat16, fwd=True, reduce_scatter=False)
+
+    def test_different_dtypes_different_keys(self):
+        p = self._param()
+        assert p._get_cache_key(torch.bfloat16, fwd=True, reduce_scatter=False) != p._get_cache_key(
+            torch.float32, fwd=True, reduce_scatter=False
+        )
+
+    def test_rs_key_differs_from_ag_key(self):
+        """reduce_scatter=True key must differ from reduce_scatter=False key."""
+        p = self._param()
+        assert p._get_cache_key(torch.bfloat16, fwd=True, reduce_scatter=False) != p._get_cache_key(
+            torch.bfloat16, fwd=True, reduce_scatter=True
+        )
+
+
+# ---------------------------------------------------------------------------
+# 17. tag_gtp_params_with_names - _debug_name population
+# ---------------------------------------------------------------------------
+
+
+class TestTagGTPParamsWithNames:
+
+    def test_debug_name_populated_for_gtp_param(self):
+        """GTPShardedParam._debug_name is set to the dotted parameter path."""
+        model = nn.Linear(4, 8, bias=False)
+        w = GTPShardedParam(torch.randn(8, 4))
+        w.group = _FakeGroup()
+        model._parameters["weight"] = w
+
+        gtp_module.tag_gtp_params_with_names(model)
+        assert w._debug_name == "weight", f"Expected 'weight', got '{w._debug_name}'"
+
+    def test_nested_module_debug_name(self):
+        """Nested module produces a dotted debug name."""
+        outer = nn.Sequential(nn.Linear(4, 8, bias=False))
+        w = GTPShardedParam(torch.randn(8, 4))
+        w.group = _FakeGroup()
+        outer._modules["0"]._parameters["weight"] = w
+
+        gtp_module.tag_gtp_params_with_names(outer)
+        assert w._debug_name == "0.weight", f"Expected '0.weight', got '{w._debug_name}'"
+
+    def test_non_gtp_params_are_skipped(self):
+        """Plain nn.Parameter instances are silently ignored."""
+        model = nn.Linear(4, 8)
+        gtp_module.tag_gtp_params_with_names(model)  # must not raise
+
+
+# ---------------------------------------------------------------------------
+# 18. wrap_module_params_gtp is a no-op when gtp_group.size() == 1
+# ---------------------------------------------------------------------------
+
+
+class TestGTPGroupSizeOne:
+
+    def test_no_sharding_when_gtp_size_one(self):
+        """wrap_module_params_gtp must be a no-op for a singleton GTP group."""
+        mod = nn.Linear(32, 64, bias=False)
+        original_weight = mod.weight
+        wrap_module_params_gtp(mod, ["weight"], _FakeGroup())
+        assert (
+            mod.weight is original_weight
+        ), "gtp_group.size()==1 should leave parameters unchanged"
+        assert not isinstance(mod.weight, GTPShardedParam)
+
+
+# ---------------------------------------------------------------------------
+# 19. weight_prefetch=False: forward still produces correct output
+# ---------------------------------------------------------------------------
+
+
+def _worker_prefetch_disabled(rank, world_size, port):
+    torch.manual_seed(0)
+    in_f, out_f = 32, 64
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    gtp_module.update_gtp_config(weight_prefetch=False)
+    try:
+        l0 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+        l1 = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+
+        inp = torch.randn(4, in_f, dtype=dtype, device="cuda")
+        dist.broadcast(inp, src=0)
+
+        # Single forward pass: builds chain and verifies output is correct
+        out = l0(inp, is_first_microbatch=True) + l1(inp, is_first_microbatch=True)
+
+        # Chain should still be wired even with prefetch disabled
+        assert l0.weight.next_w is l1.weight
+        assert torch.isfinite(out).all(), "Non-finite output with prefetch disabled"
+    finally:
+        gtp_module.update_gtp_config(weight_prefetch=True)
+
+
+class TestGTPPrefetchDisabled:
+    def test_forward_works_without_prefetch(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_prefetch_disabled, 4)
+
+
+# ---------------------------------------------------------------------------
+# 20. fuse_wgrad_accumulation=True: wgrad is accumulated into main_grad
+# ---------------------------------------------------------------------------
+
+
+def _worker_fuse_wgrad(rank, world_size, port):
+    torch.manual_seed(0)
+    in_f, out_f = 32, 128  # out_f % (16*world_size)==0, no padding
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype, fuse_wgrad_accumulation=True)
+
+    # Allocate main_grad on the local shard shape
+    w = layer.weight
+    w.main_grad = torch.zeros(w.shape, dtype=dtype, device="cuda")
+
+    inp = torch.randn(8, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    layer(inp, is_first_microbatch=True).sum().backward()
+
+    # With fused accumulation, wgrad was added into main_grad
+    assert torch.any(
+        w.main_grad != 0
+    ), "main_grad should have been updated by fused wgrad accumulation"
+
+
+class TestFuseWgradAccumulation:
+    def test_wgrad_accumulated_into_main_grad(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_fuse_wgrad, 4)
+
+
+# ---------------------------------------------------------------------------
+# 21. _grad_accum_hook is called after reduce-scatter
+# ---------------------------------------------------------------------------
+
+
+def _worker_main_grad_updated_after_bwd(rank, world_size, port):
+    """After backward, the wgrad RS path must have accumulated wgrad into main_grad."""
+    torch.manual_seed(0)
+    in_f, out_f = 32, 64
+    dtype = torch.bfloat16
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype)
+
+    # wgrad RS path always accumulates into main_grad; allocate before backward.
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+
+    inp = torch.randn(8, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+    layer(inp, is_first_microbatch=True).sum().backward()
+
+    assert torch.any(
+        layer.weight.main_grad != 0
+    ), "main_grad should have been updated after the reduce-scatter accumulation"
+
+
+class TestGTPGradAccumHook:
+    def test_main_grad_updated_after_backward(self):
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_main_grad_updated_after_bwd, 4)
+
+
+# ---------------------------------------------------------------------------
+# 22. wait_async_comms(finalize_after_drain=True) inline-accumulation fallback
+# ---------------------------------------------------------------------------
+
+
+class TestWaitAsyncCommsFallback:
+    """Exercises the inline-accumulation fallback inside
+    ``wait_async_comms(finalize_after_drain=True)``: when a param is in
+    ``_inflight_comm_params`` (async AG was issued) but its ``_wgrad_rs_handle``
+    is ``None`` (no async RS handle to drain), the inner
+    ``_wait_reduce_scatter`` call no-ops and the outer loop must inline the
+    accumulation itself (main_grad.add_ + ticket release + flag set).
+
+    Production flows rarely hit this combination — chain-interior params have
+    both async AG and async RS, and chain-head sync RS doesn't enter
+    ``_inflight_comm_params`` via bwd AG. We construct the state by hand to
+    pin down the fallback's contract.
+    """
+
+    @staticmethod
+    def _make_inflight_param(main_grad_fill=0.0, already_finalized=False):
+        """Build a minimal GTPShardedParam wired for wait_async_comms testing."""
+        dtype = torch.bfloat16
+        p = GTPShardedParam(torch.zeros(8, 4, dtype=dtype, device="cuda"))
+        p.group = _FakeGroup()
+        p.expert_idx = None
+        p.pad_length = 0
+        p.chain_id = gtp_module.GTPChain.UNGRAPHED.value
+        p._quantizer = None
+        p.is_routed_expert = False  # ⇒ self._weights property returns [self]
+        p.main_grad = torch.full((8, 4), main_grad_fill, dtype=dtype, device="cuda")
+        p._prefetch_handle = None  # _wait_param_gather is no-op
+        p._wgrad_rs_handle = None  # _wait_reduce_scatter is no-op → fallback fires
+        p._cached_ag_stream = None
+        p._cached_rs_stream = None
+        p.ag_event = torch.cuda.Event(external=True)
+        p.rs_event = torch.cuda.Event(external=True)
+        p.rs_event.record()  # so rs_event.wait() in fallback doesn't block
+        p._already_finalized = already_finalized
+        p.grad_added_to_main_grad = False
+        return p
+
+    @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA required")
+    def test_fallback_accumulates_when_no_rs_handle(self):
+        dtype = torch.bfloat16
+        p = self._make_inflight_param(main_grad_fill=0.0)
+
+        # Place a known wgrad in the cache for the fallback to read.
+        cache = gtp_module.get_global_GTP_cache()
+        p._rs_ticket = cache.reserve(p, dtype, fwd=False, reduce_scatter=True)
+        cache.get(p._rs_ticket).fill_(2.0)
+
+        # Save + replace _inflight_comm_params so we don't trip over leftover
+        # params from earlier tests in the loop.
+        saved = set(gtp_module._inflight_comm_params)
+        gtp_module._inflight_comm_params.clear()
+        gtp_module._inflight_comm_params.add(p)
+        try:
+            gtp_module.wait_async_comms(
+                chain_id=p.chain_id, skip_rs=False, finalize_after_drain=True
+            )
+        finally:
+            gtp_module._inflight_comm_params.clear()
+            gtp_module._inflight_comm_params.update(saved)
+
+        torch.cuda.synchronize()
+        assert torch.all(
+            p.main_grad == 2.0
+        ), f"main_grad should be 2.0 after fallback accumulation; got {p.main_grad}"
+        assert p._already_finalized is True, "_already_finalized must be set"
+        assert p.grad_added_to_main_grad is True, "grad_added_to_main_grad must be set"
+
+    @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA required")
+    def test_fallback_skipped_when_already_finalized(self):
+        """When _already_finalized=True, the fallback must NOT re-accumulate."""
+        p = self._make_inflight_param(main_grad_fill=5.0, already_finalized=True)
+        # No _rs_ticket: if the fallback ran it would AttributeError on cache.get(None).
+        p._rs_ticket = None
+
+        saved = set(gtp_module._inflight_comm_params)
+        gtp_module._inflight_comm_params.clear()
+        gtp_module._inflight_comm_params.add(p)
+        try:
+            gtp_module.wait_async_comms(
+                chain_id=p.chain_id, skip_rs=False, finalize_after_drain=True
+            )
+        finally:
+            gtp_module._inflight_comm_params.clear()
+            gtp_module._inflight_comm_params.update(saved)
+
+        torch.cuda.synchronize()
+        assert torch.all(
+            p.main_grad == 5.0
+        ), "main_grad must be untouched when _already_finalized=True"
+
+    @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA required")
+    def test_fallback_skipped_for_pure_ag_param(self):
+        """Regression: cross-graph fwd-AG prefetch in flight + finalize_after_drain=True.
+
+        A param can be in _inflight_comm_params because of an outstanding async
+        all-gather (e.g. a cross-graph forward prefetch reaching the
+        bwd→optimizer boundary).  No reduce-scatter was ever issued for that
+        param, so _rs_ticket is None on every weight.  Previously the fallback
+        called cache.get(None) and crashed with KeyError; the guard now skips
+        the inline accumulation entirely when no weight has an RS ticket.
+        """
+        p = self._make_inflight_param(main_grad_fill=7.0)
+        # Critical: simulates a pure-AG prefetch — no RS ever issued, ticket is None.
+        p._rs_ticket = None
+
+        saved = set(gtp_module._inflight_comm_params)
+        gtp_module._inflight_comm_params.clear()
+        gtp_module._inflight_comm_params.add(p)
+        try:
+            # Must NOT raise KeyError(None) from cache.get(None).
+            gtp_module.wait_async_comms(
+                chain_id=p.chain_id, skip_rs=False, finalize_after_drain=True
+            )
+        finally:
+            gtp_module._inflight_comm_params.clear()
+            gtp_module._inflight_comm_params.update(saved)
+
+        torch.cuda.synchronize()
+        assert torch.all(
+            p.main_grad == 7.0
+        ), "main_grad must be untouched for a pure-AG param (no wgrad to accumulate)"
+        assert (
+            p._already_finalized is False
+        ), "_already_finalized must stay False — no finalize happened for a pure-AG param"
+
+
+# ---------------------------------------------------------------------------
+# 23. GTP DDP bucket alignment: distributed optimizer bucket-end assertion
+# ---------------------------------------------------------------------------
+
+
+def _worker_gtp_ddp_bucket_alignment(rank, world_size, port):
+    """GTP param buffers in DDP must use padded bucket layout with use_distributed_optimizer=True.
+
+    Bug: DDP used param_layout=None for GTP buffers, falling through to
+    _compute_default_per_buffer_param_layout, which packs params without padding bucket ends.
+    The distributed optimizer requires every bucket end to be divisible by
+    intra_dp_cp_no_gtp_group.size() (asserted at param_and_grad_buffer.py:1427).
+
+    Trigger:
+      GTP=2, DP=4  →  intra_dp_cp_no_gtp_group.size()=2
+      pad_for_alignment=0, weight [out=2,in=3]  →  GTP shard=[1,3]=3 elements (odd)
+      Two GTP params: total=6, 6%2==0 (total check passes); bucket_size=3 forces
+      bucket-0 to contain only the first param, end=3, 3%2≠0  →  AssertionError
+    """
+    from megatron.core import parallel_state as ps
+    from megatron.core.distributed import DistributedDataParallel, DistributedDataParallelConfig
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    # The module fixture initialized model_parallel without GTP; re-init with GTP=2.
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+    )
+
+    orig_pad = gtp_module.GTP_CONFIG.pad_for_alignment
+    gtp_module.GTP_CONFIG.pad_for_alignment = 0
+    try:
+        gtp_group = ps.get_gtp_weight_remat_group()
+
+        class _TwoLayerModel(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                self.fc0 = te.Linear(3, 2, bias=False, device="cuda")
+                self.fc1 = te.Linear(3, 2, bias=False, device="cuda")
+
+        model = _TwoLayerModel()
+        wrap_module_params_gtp(model.fc0, ["weight"], gtp_group)
+        wrap_module_params_gtp(model.fc1, ["weight"], gtp_group)
+
+        config = TransformerConfig(
+            num_attention_heads=1, num_layers=1, hidden_size=4, tensor_model_parallel_size=1
+        )
+        ddp_config = DistributedDataParallelConfig(
+            use_distributed_optimizer=True, overlap_grad_reduce=True, bucket_size=3
+        )
+
+        # Without the fix this raises AssertionError at param_and_grad_buffer.py:1427:
+        #   assert end_index % self.data_parallel_world_size == 0
+        DistributedDataParallel(config, ddp_config, model)
+    finally:
+        gtp_module.GTP_CONFIG.pad_for_alignment = orig_pad
+        ps.destroy_model_parallel()
+        ps.initialize_model_parallel()  # restore default for remaining tests
+
+
+def _worker_regular_buffer_padded_when_gtp_params_present(rank, world_size, port):
+    """Regular (non-GTP) param buffers in DDP must also use padded layout when GTP is active.
+
+    Bug: when gtp_params is non-empty, full_param_layout.layouts contains stale GTP entries
+    that don't belong to the regular buffer, causing KeyErrors in DistOpt's param map.
+    DDP avoided this by forcing param_layout=None for regular buffers, but that falls through
+    to _compute_default_per_buffer_param_layout, which produces unpadded bucket ends, again
+    violating param_and_grad_buffer.py:1427 (end_index % data_parallel_world_size == 0).
+
+    Trigger:
+      GTP=2, DP=4  →  intra_dp_cp_group.size()=4  (regular params reduce over the full DP group)
+      bias=True  →  each bias has 2 elements (not divisible by 4)
+      Two layers: total regular numel=4, 4%4==0 (total check passes); bucket_size=2 forces
+      bucket-0 to contain only the first bias, end=2, 2%4≠0  →  AssertionError
+    """
+    from megatron.core import parallel_state as ps
+    from megatron.core.distributed import DistributedDataParallel, DistributedDataParallelConfig
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+    )
+
+    orig_pad = gtp_module.GTP_CONFIG.pad_for_alignment
+    gtp_module.GTP_CONFIG.pad_for_alignment = 0
+    try:
+        gtp_group = ps.get_gtp_weight_remat_group()
+
+        class _TwoLayerModelWithBias(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                # bias=True: weight → GTPShardedParam (gtp_buffer), bias → regular param
+                self.fc0 = te.Linear(3, 2, bias=True, device="cuda")
+                self.fc1 = te.Linear(3, 2, bias=True, device="cuda")
+
+        model = _TwoLayerModelWithBias()
+        wrap_module_params_gtp(model.fc0, ["weight"], gtp_group)
+        wrap_module_params_gtp(model.fc1, ["weight"], gtp_group)
+
+        config = TransformerConfig(
+            num_attention_heads=1, num_layers=1, hidden_size=4, tensor_model_parallel_size=1
+        )
+        # bucket_size=2: each 2-element bias fills one bucket in the regular buffer.
+        # Without the fix: regular buffer uses param_layout=None → bucket-0 ends at 2,
+        # 2 % intra_dp_cp_group.size()(=4) != 0 → AssertionError at line 1427.
+        ddp_config = DistributedDataParallelConfig(
+            use_distributed_optimizer=True, overlap_grad_reduce=True, bucket_size=2
+        )
+
+        DistributedDataParallel(config, ddp_config, model)
+    finally:
+        gtp_module.GTP_CONFIG.pad_for_alignment = orig_pad
+        ps.destroy_model_parallel()
+        ps.initialize_model_parallel()
+
+
+class TestGTPDDPBucketAlignment:
+    def test_gtp_buffers_use_padded_layout_with_distributed_optimizer(self):
+        """GTP buffer bucket ends must be padded to intra_dp_cp_no_gtp_group.size()."""
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_gtp_ddp_bucket_alignment, 4)
+
+    def test_regular_buffers_use_padded_layout_when_gtp_params_present(self):
+        """Regular buf bucket ends must be padded even when gtp_params forces layoutrecompute."""
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_regular_buffer_padded_when_gtp_params_present, 4)
+
+
+# ---------------------------------------------------------------------------
+# 24. GTP DDP grad-ready wiring: register_grad_ready must fire AFTER the wgrad add
+# ---------------------------------------------------------------------------
+
+
+def _worker_gtp_ddp_grad_ready_wiring(rank, world_size, port):
+    """GTP params must drive DDP grad-ready from GTP's manual hook, not autograd.
+
+    GTP defers the main_grad accumulation to a later backward node, so autograd's AccumulateGrad can
+    fire register_grad_ready before the grad lands and dispatch the bucket reduce-scatter on stale
+    grad_data (corrupts reduce_scatter_with_fp32_accumulation). The fix routes grad-ready through
+    register_grad_accum_hook (fired after the add) and skips the autograd hook. This pins that
+    wiring: every GTP weight has _grad_accum_hook set and none falls through to the autograd list.
+    """
+    from megatron.core import parallel_state as ps
+    from megatron.core.distributed import DistributedDataParallel, DistributedDataParallelConfig
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    # The module fixture initialized model_parallel without GTP; re-init with GTP=2.
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+    )
+    try:
+        gtp_group = ps.get_gtp_weight_remat_group()
+
+        class _TwoLayerModel(torch.nn.Module):
+            def __init__(self):
+                super().__init__()
+                # bias=False -> all params are GTP weights, so grad_accs must end up empty.
+                self.fc0 = te.Linear(64, 128, bias=False, device="cuda")
+                self.fc1 = te.Linear(64, 128, bias=False, device="cuda")
+
+        model = _TwoLayerModel()
+        wrap_module_params_gtp(model.fc0, ["weight"], gtp_group)
+        wrap_module_params_gtp(model.fc1, ["weight"], gtp_group)
+
+        config = TransformerConfig(
+            num_attention_heads=1, num_layers=1, hidden_size=4, tensor_model_parallel_size=1
+        )
+        ddp_config = DistributedDataParallelConfig(
+            use_distributed_optimizer=True, overlap_grad_reduce=True
+        )
+        ddp_model = DistributedDataParallel(config, ddp_config, model)
+
+        for name, w in [("fc0", model.fc0.weight), ("fc1", model.fc1.weight)]:
+            assert isinstance(w, GTPShardedParam), f"{name}.weight should be a GTP param"
+            # Manual hook set -> grad-ready fires after the add; None -> early autograd path (bug).
+            assert (
+                getattr(w, "_grad_accum_hook", None) is not None
+            ), f"{name}.weight must have _grad_accum_hook set (manual grad-ready, not autograd)"
+
+        # bias=False -> all params are GTP -> none took the autograd path.
+        assert len(ddp_model.grad_accs) == 0, (
+            "GTP params must not register an autograd AccumulateGrad hook "
+            f"(grad_accs has {len(ddp_model.grad_accs)} entries)"
+        )
+    finally:
+        ps.destroy_model_parallel()
+        ps.initialize_model_parallel()  # restore default for remaining tests
+
+
+class TestGTPDDPGradReadyWiring:
+    def test_gtp_params_use_manual_grad_ready_hook(self):
+        """GTP params route DDP grad-ready through register_grad_accum_hook, not autograd."""
+        _requires_multi_gpu(4)
+        _run_distributed(_worker_gtp_ddp_grad_ready_wiring, 4)
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_gtp_cudagraph_grad.py b/tests/unit_tests/generalized_tensor_parallel/test_gtp_cudagraph_grad.py
new file mode 100644
index 00000000000..5ef0701034c
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_gtp_cudagraph_grad.py
@@ -0,0 +1,95 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Regression test for the GTP + CUDA-graph capture-step grad-norm bug.
+
+Bug: create_cudagraphs() runs after finalize_model_grads, so main_grad already holds the finalized
+(reduced + per-token-scaled) grads. create_fwd_graph then runs an eager warmup backward (graph
+capture only records ops, it doesn't run them), and that eager backward executes GTP's wgrad
+main_grad.add_ -- including the cascade add into a param's cross-graph ``next_w`` (in another
+module, via a stale RS ticket) -- clobbering the finalized grads and spiking the step's grad norm.
+
+Fix: create_fwd_graph snapshots the grads its warmup touches via ``_backup_grads_before_capture``
+and restores them after. This test exercises that helper pair directly: the module's own params
+and their cross-graph ``next_w`` must survive a simulated warmup clobber.
+"""
+
+import pytest
+import torch
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+from megatron.core.transformer.cuda_graphs import (
+    _backup_grads_before_capture,
+    _restore_grads_after_capture,
+)
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TE with hook registry", allow_module_level=True)
+
+
+def _gtp_param(value: float, numel: int = 8) -> torch.nn.Parameter:
+    """A param with a finalized (reduced + scaled) main_grad, flagged as a GTP weight."""
+    p = torch.nn.Parameter(torch.zeros(numel, device="cuda"))
+    p.is_gtp = True
+    p.main_grad = torch.full((numel,), value, device="cuda")
+    return p
+
+
+class _Mod(torch.nn.Module):
+    def __init__(self, weight: torch.nn.Parameter):
+        super().__init__()
+        self.weight = weight
+
+
+class _StubRunner:
+    """The ``base_module`` and ``gtp_remat`` attrs that ``_backup_grads_before_capture`` reads."""
+
+    def __init__(self, base_module: torch.nn.Module, gtp_remat: bool = True):
+        self.base_module = base_module
+        self.gtp_remat = gtp_remat
+
+
+class TestGTPCaptureGradSnapshot:
+    def test_preserves_own_and_cross_graph_next_w(self):
+        """Snapshot/restore must keep both the module's own grad and its cross-graph next_w grad
+        (in another module) intact across a capture that clobbers them."""
+        own = _gtp_param(0.0125)
+        cross = _gtp_param(0.02)  # next_w lives in a different module/graph
+        own.next_w = cross
+        runner = _StubRunner(_Mod(own))
+
+        backup = _backup_grads_before_capture(runner)
+        own.main_grad.add_(410.0)  # simulate the capture-time main_grad.add_ clobber
+        cross.main_grad.add_(99.0)
+        _restore_grads_after_capture(backup)
+
+        torch.testing.assert_close(own.main_grad, torch.full((8,), 0.0125, device="cuda"))
+        torch.testing.assert_close(cross.main_grad, torch.full((8,), 0.02, device="cuda"))
+
+    def test_routed_expert_next_w_via_weight_list(self):
+        """A routed-expert next_w exposes its shards via ``weight_list`` (read directly, since the
+        ``_weights`` property raises on non-leaders before capture)."""
+        own = _gtp_param(0.0125)
+        shard0, shard1 = _gtp_param(0.03), _gtp_param(0.04)
+        routed = torch.nn.Parameter(torch.zeros(8, device="cuda"))  # leader wrapper (no own grad)
+        routed.is_routed_expert = True
+        routed.weight_list = [shard0, shard1]
+        own.next_w = routed
+        runner = _StubRunner(_Mod(own))
+
+        backup = _backup_grads_before_capture(runner)
+        shard0.main_grad.add_(50.0)
+        shard1.main_grad.add_(60.0)
+        _restore_grads_after_capture(backup)
+
+        torch.testing.assert_close(shard0.main_grad, torch.full((8,), 0.03, device="cuda"))
+        torch.testing.assert_close(shard1.main_grad, torch.full((8,), 0.04, device="cuda"))
+
+    def test_non_gtp_backs_up_own_params_only(self):
+        """Non-GTP runner: own params are snapshotted, but the GTP cross-graph next_w walk is
+        skipped (the bwd capture doesn't touch main_grad on the non-GTP path)."""
+        own = _gtp_param(0.0125)
+        cross = _gtp_param(0.02)
+        own.next_w = cross
+        backup = _backup_grads_before_capture(_StubRunner(_Mod(own), gtp_remat=False))
+        assert id(own) in backup
+        assert id(cross) not in backup
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_gtp_dcp.py b/tests/unit_tests/generalized_tensor_parallel/test_gtp_dcp.py
new file mode 100644
index 00000000000..9c1b054fe54
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_gtp_dcp.py
@@ -0,0 +1,792 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Unit tests for GTP + distributed checkpointing.
+
+Verifies that ``make_sharded_tensors_for_checkpoint_with_gtp`` emits
+ShardedTensor offsets that correctly encode TP × GTP sharding, and that
+the helper is a no-op (delegates to vanilla) when no ``GTPShardedParam``
+is present in the input state_dict.
+
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+
+from megatron.core.dist_checkpointing import ShardedTensor
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TE with hook registry", allow_module_level=True)
+
+from megatron.core.tensor_parallel.gtp import (  # noqa: E402
+    GTP_CONFIG,
+    GTPShardedParam,
+    make_sharded_tensors_for_checkpoint_with_gtp,
+    reset_gtp_quantize_cache,
+    update_gtp_config,
+    wrap_module_params_gtp,
+)
+from tests.unit_tests.test_utilities import Utils  # noqa: E402
+
+
+@pytest.fixture(autouse=True)
+def _no_pad_alignment():
+    """Disable GTP padding for the duration of each test so local shard sizes
+    are exactly ``per_tp_out / gtp_size`` and the test math stays simple.
+    DCP semantics with padding are exercised by the integration tests.
+    """
+    orig = GTP_CONFIG.pad_for_alignment
+    update_gtp_config(pad_for_alignment=0)
+    yield
+    update_gtp_config(pad_for_alignment=orig)
+
+
+@pytest.fixture(scope="module", autouse=True)
+def _torchrun_dist_init():
+    Utils.initialize_model_parallel()
+    yield
+    Utils.destroy_model_parallel()
+
+
+def _require_world_size(n):
+    if dist.get_world_size() != n:
+        pytest.skip(
+            f"Requires world_size={n}, got {dist.get_world_size()} "
+            f"(launch with torchrun --nproc-per-node={n})"
+        )
+
+
+def _make_gtp_shard(out_features, in_features, gtp_group, dtype=torch.bfloat16):
+    """Build a small GTPShardedParam by wrapping a one-param dummy module."""
+
+    class _Dummy(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.weight = torch.nn.Parameter(
+                torch.arange(out_features * in_features, dtype=dtype, device="cuda").reshape(
+                    out_features, in_features
+                )
+            )
+
+    mod = _Dummy()
+    wrap_module_params_gtp(mod, ["weight"], gtp_group)
+    return mod.weight  # now a GTPShardedParam
+
+
+def _worker_helper_offsets_tp_eq_gtp_axis(rank, world_size, port):
+    """TP=2, GTP=2 (4 ranks total). Weight is GTPShardedParam.
+
+    Production flow: Mcore TE constructs the Linear with already-TP-sliced
+    out_features (i.e. full / tp_size). GTP then slices that further by
+    gtp_size. We mimic that by starting with a per-TP-rank tensor of size
+    ``full // tp_size`` and letting wrap_module_params_gtp slice it.
+    """
+    gtp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+    tp_group = dist.new_group([0, 2]) if rank in (0, 2) else dist.new_group([1, 3])
+
+    full_out_features = 8
+    tp_size, gtp_size = 2, 2
+    per_tp_out = full_out_features // tp_size  # 4
+    per_shard_out = per_tp_out // gtp_size  # 2
+    in_features = 4
+
+    weight = _make_gtp_shard(per_tp_out, in_features, gtp_group)
+    assert weight.shape == (per_shard_out, in_features), (
+        f"rank={rank} local shard shape {tuple(weight.shape)} != "
+        f"({per_shard_out}, {in_features})"
+    )
+
+    sharded = make_sharded_tensors_for_checkpoint_with_gtp(
+        {"weight": weight},
+        prefix="",
+        tensor_parallel_layers_axis_map={"weight": 0},
+        sharded_offsets=(),
+        tp_group=tp_group,
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    st = sharded["weight"]
+    assert isinstance(st, ShardedTensor), f"Expected ShardedTensor, got {type(st)}"
+
+    # Composite offset: (axis=0, tp_rank*gtp_size+gtp_rank, tp_size*gtp_size)
+    # rank → (tp_rank, gtp_rank): 0→(0,0), 1→(0,1), 2→(1,0), 3→(1,1)
+    tp_rank = rank // 2
+    gtp_rank = rank % 2
+    expected_offset = (tp_rank * gtp_size + gtp_rank) * per_shard_out
+    assert (
+        st.global_offset[0] == expected_offset
+    ), f"rank={rank} expected axis-0 offset {expected_offset}, got {st.global_offset[0]}"
+    assert (
+        st.global_shape[0] == full_out_features
+    ), f"rank={rank} expected global axis-0 size {full_out_features}, got {st.global_shape[0]}"
+
+
+def _worker_helper_offsets_tp_neq_gtp_axis(rank, world_size, port):
+    """Row-parallel: TP=2 shards axis 1, GTP=2 shards axis 0.
+
+    Per-TP-rank tensor: (full_out, full_in/tp_size). GTP further shards
+    axis 0 to (full_out/gtp_size, full_in/tp_size).
+    """
+    gtp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+    tp_group = dist.new_group([0, 2]) if rank in (0, 2) else dist.new_group([1, 3])
+
+    full_out, full_in = 8, 4
+    tp_size, gtp_size = 2, 2
+    per_tp_in = full_in // tp_size  # 2
+    per_shard_out = full_out // gtp_size  # 4
+
+    weight = _make_gtp_shard(full_out, per_tp_in, gtp_group)
+    assert weight.shape == (per_shard_out, per_tp_in)
+
+    sharded = make_sharded_tensors_for_checkpoint_with_gtp(
+        {"weight": weight},
+        prefix="",
+        tensor_parallel_layers_axis_map={"weight": 1},  # row-parallel
+        sharded_offsets=(),
+        tp_group=tp_group,
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    st = sharded["weight"]
+    tp_rank = rank // 2
+    gtp_rank = rank % 2
+    assert (
+        st.global_offset[0] == gtp_rank * per_shard_out
+    ), f"rank={rank} axis-0 offset wrong: {st.global_offset[0]}"
+    assert (
+        st.global_offset[1] == tp_rank * per_tp_in
+    ), f"rank={rank} axis-1 offset wrong: {st.global_offset[1]}"
+    assert st.global_shape == (
+        full_out,
+        full_in,
+    ), f"rank={rank} global shape {st.global_shape} != ({full_out}, {full_in})"
+
+
+def _worker_helper_no_op_no_gtp(rank, world_size, port):
+    """Helper must delegate to vanilla when state_dict has no GTPShardedParam.
+
+    Per-TP-rank shape under column-parallel TP=2: (full_out//tp_size, in).
+    """
+    tp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+
+    full_out, in_features, tp_size = 8, 4, 2
+    per_tp_out = full_out // tp_size
+
+    plain = torch.nn.Parameter(
+        torch.zeros(per_tp_out, in_features, dtype=torch.bfloat16, device="cuda")
+    )
+    bias = torch.nn.Parameter(torch.zeros(per_tp_out, dtype=torch.bfloat16, device="cuda"))
+
+    sharded = make_sharded_tensors_for_checkpoint_with_gtp(
+        {"weight": plain, "bias": bias},
+        prefix="",
+        tensor_parallel_layers_axis_map={"weight": 0, "bias": 0},
+        sharded_offsets=(),
+        tp_group=tp_group,
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    # tp_group is [0,1] for ranks 0,1 and [2,3] for ranks 2,3 here — local tp_rank = rank % 2
+    tp_rank = rank % 2
+    assert sharded["weight"].global_offset[0] == tp_rank * per_tp_out, (
+        f"rank={rank} fallback path produced wrong offset for weight: "
+        f"{sharded['weight'].global_offset[0]}"
+    )
+    assert sharded["weight"].global_shape == (full_out, in_features)
+
+
+def _worker_helper_padded_inproj_no_pad_case(rank, world_size, port):
+    """``in_proj.weight`` shape modeled after the production case (z|x|B|C|dt
+    concat along dim 0). With GTP=4 and these dim-0 sizes the alignment
+    constraint ``dim0 % (gtp_size * pad_for_alignment) == 0`` is satisfied —
+    *no* padding fires. Verify the helper emits the expected offsets.
+    """
+    update_gtp_config(pad_for_alignment=16)
+    # dim0 = 512+512+64+64+8 = 1160 → 1160 % (4*16=64) = 8 ⇒ NOT aligned.
+    # Pick sizes that ARE aligned to 64 to exercise the no-pad path:
+    dim0 = 1152  # = 18 * 64; alignment-clean for gtp_size=4, pad=16
+    in_features = 4
+
+    # All 4 ranks form a single GTP group.
+    gtp_group = dist.new_group(list(range(world_size)))
+    weight = _make_gtp_shard(dim0, in_features, gtp_group)
+
+    # No padding ⇒ local shape is exactly dim0 / 4 = 288
+    expected_local = dim0 // 4
+    assert weight.shape == (expected_local, in_features), (
+        f"rank={rank}: padding should NOT have fired (dim0 aligned); "
+        f"got local shape {tuple(weight.shape)}, expected ({expected_local}, {in_features})"
+    )
+    assert getattr(weight, "pad_length", 0) == 0
+
+    sharded = make_sharded_tensors_for_checkpoint_with_gtp(
+        {"weight": weight},
+        prefix="",
+        tensor_parallel_layers_axis_map={"weight": 0},
+        sharded_offsets=(),
+        tp_group=dist.new_group([rank]),  # trivial 1-rank TP group
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    st = sharded["weight"]
+    assert (
+        st.global_shape[0] == dim0
+    ), f"rank={rank} no-pad case: global_shape[0] {st.global_shape[0]} != {dim0}"
+    assert st.global_offset[0] == rank * expected_local
+
+
+def _worker_helper_padded_inproj_pad_case(rank, world_size, port):
+    """Same in_proj layout but with a dim-0 size that requires GTP padding.
+
+    z=512, x=512, B=64, C=64, dt=8 → dim0=1160. With gtp_size=4 and
+    pad_for_alignment=16, alignment block = 64; 1160 % 64 = 8 so 56 pad
+    rows are appended. Padded dim0 = 1216, per-rank shard = 304 (uniform
+    across all 4 ranks; the pad rows live at the tail of rank-3's slice).
+
+    The helper today saves the *padded* global shape (1216) — round-trip is
+    correct under save_gtp_size == load_gtp_size. This test pins that
+    behaviour and serves as a regression for the future "unpadded global"
+    fix.
+    """
+    update_gtp_config(pad_for_alignment=16)
+    dim0_unpadded = 1160  # z(512) + x(512) + B(64) + C(64) + dt(8)
+    in_features = 4
+    gtp_size = world_size
+    alignment_block = 16 * gtp_size  # = 64
+    pad = (alignment_block - dim0_unpadded % alignment_block) % alignment_block
+    dim0_padded = dim0_unpadded + pad
+    per_shard = dim0_padded // gtp_size
+
+    gtp_group = dist.new_group(list(range(world_size)))
+    weight = _make_gtp_shard(dim0_unpadded, in_features, gtp_group)
+
+    assert weight.shape == (
+        per_shard,
+        in_features,
+    ), f"rank={rank}: post-pad shard shape {tuple(weight.shape)} != ({per_shard}, {in_features})"
+    # Only rank-3 (the last GTP rank) carries the trailing pad rows; all ranks
+    # report the same pad_length (an invariant set by _gtp_slice_one_param).
+    assert (
+        getattr(weight, "pad_length", 0) == pad
+    ), f"rank={rank}: pad_length {getattr(weight, 'pad_length', 0)} != {pad}"
+
+    sharded = make_sharded_tensors_for_checkpoint_with_gtp(
+        {"weight": weight},
+        prefix="",
+        tensor_parallel_layers_axis_map={"weight": 0},
+        sharded_offsets=(),
+        tp_group=dist.new_group([rank]),
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    st = sharded["weight"]
+    # Helper saves the padded global. ``allow_shape_mismatch=True`` is what
+    # makes the saved tensor portable to a different load-time GTP topology
+    # (different alignment choice yields a different padded size).
+    assert (
+        st.global_shape[0] == dim0_padded
+    ), f"rank={rank} pad case: global_shape[0] {st.global_shape[0]} != {dim0_padded}"
+    assert st.global_offset[0] == rank * per_shard
+    assert st.allow_shape_mismatch is True, (
+        f"rank={rank} pad case: allow_shape_mismatch must be True when GTP padding fires; "
+        f"otherwise the ckpt cannot be loaded at a different GTP topology."
+    )
+
+
+def _worker_helper_cross_topology_reshard_metadata(rank, world_size, port):
+    """Pin the cross-topology reshard contract via ShardedTensor metadata.
+
+    We can't run a real DCP save/load against itself within a single torchrun
+    (need separate worlds), but we can verify the saved ShardedTensor carries
+    everything DCP needs to do the reshard: ``allow_shape_mismatch=True`` and
+    a global_shape large enough to cover any compatible load-side topology
+    (≥ unpadded original).
+    """
+    update_gtp_config(pad_for_alignment=16)
+    dim0_unpadded = 1160
+    in_features = 4
+    gtp_size = world_size
+    alignment_block = 16 * gtp_size  # 64
+    dim0_padded = (
+        dim0_unpadded + (alignment_block - dim0_unpadded % alignment_block) % alignment_block
+    )
+    per_shard = dim0_padded // gtp_size
+
+    gtp_group = dist.new_group(list(range(world_size)))
+    weight = _make_gtp_shard(dim0_unpadded, in_features, gtp_group)
+
+    sharded = make_sharded_tensors_for_checkpoint_with_gtp(
+        {"weight": weight},
+        prefix="",
+        tensor_parallel_layers_axis_map={"weight": 0},
+        sharded_offsets=(),
+        tp_group=dist.new_group([rank]),
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    st = sharded["weight"]
+    # 1. The saved global covers >= unpadded original size.
+    assert st.global_shape[0] >= dim0_unpadded, (
+        f"rank={rank} saved global_shape ({st.global_shape[0]}) < unpadded ({dim0_unpadded}); "
+        f"would lose valid data on cross-topology reshard."
+    )
+    # 2. ``allow_shape_mismatch=True`` lets DCP tolerate that the load-side
+    #    padded size may differ.
+    assert st.allow_shape_mismatch is True
+    # 3. Each rank's offset+local_shape covers a contiguous slice of the
+    #    padded global; together the ranks cover [0, padded_global).
+    assert st.global_offset[0] + st.local_shape[0] <= st.global_shape[0]
+    assert st.global_offset[0] + st.local_shape[0] == (rank + 1) * per_shard
+
+
+def _worker_save_then_load_offsets_symmetric(rank, world_size, port):
+    """Save-side and load-side ShardedTensors must produce identical offsets
+    and global_shape so DCP can correctly resharded between them.
+
+    We don't run the real DCP save (avoids filesystem / async-writer issues
+    in CI); we just verify the symmetry property the load path relies on.
+    """
+    update_gtp_config(pad_for_alignment=0)
+    dim0 = 16
+    in_features = 4
+    gtp_group = dist.new_group(list(range(world_size)))
+
+    def _build(prefix):
+        weight = _make_gtp_shard(dim0, in_features, gtp_group)
+        return make_sharded_tensors_for_checkpoint_with_gtp(
+            {"weight": weight},
+            prefix=prefix,
+            tensor_parallel_layers_axis_map={"weight": 0},
+            sharded_offsets=(),
+            tp_group=dist.new_group([rank]),
+            dp_cp_group=dist.new_group(list(range(world_size))),
+        )["layer.weight"]
+
+    save_st = _build("layer.")
+    load_st = _build("layer.")
+    assert save_st.global_shape == load_st.global_shape
+    assert save_st.global_offset == load_st.global_offset
+    assert save_st.local_shape == load_st.local_shape
+    assert save_st.replica_id == load_st.replica_id
+
+
+def _worker_reset_quantize_cache(rank, world_size, port):
+    """`reset_gtp_quantize_cache` must flip did_cast_to_low_precision back to False."""
+    gtp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+
+    class _Dummy(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.weight = torch.nn.Parameter(torch.zeros(4, 4, dtype=torch.bfloat16, device="cuda"))
+
+    mod = _Dummy()
+    wrap_module_params_gtp(mod, ["weight"], gtp_group)
+    p = mod.weight
+    p.did_cast_to_low_precision = True
+
+    reset_gtp_quantize_cache(mod)
+    assert p.did_cast_to_low_precision is False
+
+
+def _worker_helper_offsets_ep_egtp(rank, world_size, port):
+    """EP=2, EGTP=2 (4 ranks): routed-expert weight.
+
+    Mirrors ``TEGroupedLinear.sharded_state_dict``: expert parallelism prepends a
+    global-expert axis through ``sharded_offsets``, and EGTP shards each expert's
+    ``out_features`` (axis 0). The GTP-aware checkpoint helper layers the EGTP
+    axis-0 split on top of the prepended expert offset.
+
+    rank → (ep_rank, egtp_rank): 0→(0,0) 1→(0,1) 2→(1,0) 3→(1,1).
+    """
+    egtp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+
+    ep_size, egtp_size, num_gemms = 2, 2, 1
+    ep_rank = rank // 2
+    egtp_rank = rank % 2
+    per_expert_out = 4
+    per_shard_out = per_expert_out // egtp_size  # 2
+    in_features = 4
+    num_global_experts = ep_size * num_gemms  # 2
+    global_expert_idx = ep_rank * num_gemms  # + gemm_idx (0)
+
+    weight = _make_gtp_shard(per_expert_out, in_features, egtp_group)
+    assert weight.shape == (
+        per_shard_out,
+        in_features,
+    ), f"rank={rank} EGTP shard shape {tuple(weight.shape)} != ({per_shard_out}, {in_features})"
+
+    sharded = make_sharded_tensors_for_checkpoint_with_gtp(
+        {"weight": weight},
+        prefix="",
+        tensor_parallel_layers_axis_map={"weight": 0},
+        # EP prepends the global-expert axis; EGTP shards out_features below it.
+        sharded_offsets=((0, global_expert_idx, num_global_experts),),
+        tp_group=dist.new_group([rank]),  # no TP in this case
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    st = sharded["weight"]
+    assert isinstance(st, ShardedTensor), f"Expected ShardedTensor, got {type(st)}"
+    # global shape = (num_global_experts, full_out_features, in_features)
+    assert st.global_shape == (num_global_experts, per_expert_out, in_features), (
+        f"rank={rank} global_shape {st.global_shape} != "
+        f"({num_global_experts}, {per_expert_out}, {in_features})"
+    )
+    # Prepended expert axis (axis 0): offset == this rank's global expert index.
+    assert (
+        st.global_offset[0] == global_expert_idx
+    ), f"rank={rank} expert-axis offset {st.global_offset[0]} != {global_expert_idx}"
+    # EGTP axis (weight axis 0, shifted to global axis 1): offset == egtp_rank · per_shard.
+    assert (
+        st.global_offset[1] == egtp_rank * per_shard_out
+    ), f"rank={rank} EGTP axis-1 offset {st.global_offset[1]} != {egtp_rank * per_shard_out}"
+
+
+def _worker_helper_embedding_offsets(rank, world_size, port):
+    """Embedding / output_layer path: ``VocabParallelEmbedding.sharded_state_dict`` calls
+    ``make_tp_sharded_tensor_for_checkpoint`` DIRECTLY (it needs allow_shape_mismatch for
+    vocab padding), bypassing the GTP-aware wrapper. So that helper itself must layer the
+    GTP axis-0 split. TP=2, GTP=2, tp_axis=0 (vocab) → composite axis-0 offset, same as the
+    column-parallel case.
+    """
+    from megatron.core.utils import make_tp_sharded_tensor_for_checkpoint
+
+    gtp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+    tp_group = dist.new_group([0, 2]) if rank in (0, 2) else dist.new_group([1, 3])
+
+    full_vocab, hidden = 8, 4
+    tp_size, gtp_size = 2, 2
+    per_tp = full_vocab // tp_size  # 4
+    per_shard = per_tp // gtp_size  # 2
+
+    weight = _make_gtp_shard(per_tp, hidden, gtp_group)
+    assert weight.shape == (per_shard, hidden)
+
+    st = make_tp_sharded_tensor_for_checkpoint(
+        tensor=weight,
+        key="embedding.word_embeddings.weight",
+        tp_axis=0,
+        allow_shape_mismatch=True,  # how VocabParallelEmbedding calls it
+        prepend_offsets=(),
+        tp_group=tp_group,
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    assert isinstance(st, ShardedTensor), f"Expected ShardedTensor, got {type(st)}"
+    tp_rank = rank // 2
+    gtp_rank = rank % 2
+    expected_offset = (tp_rank * gtp_size + gtp_rank) * per_shard
+    assert (
+        st.global_offset[0] == expected_offset
+    ), f"rank={rank} embedding axis-0 offset {st.global_offset[0]} != {expected_offset}"
+    assert (
+        st.global_shape[0] == full_vocab
+    ), f"rank={rank} embedding global axis-0 {st.global_shape[0]} != {full_vocab}"
+
+
+def _worker_helper_public_wrapper_delegates(rank, world_size, port):
+    """The public ``make_sharded_tensors_for_checkpoint`` (the entry point most layers call,
+    e.g. ColumnParallelLinear / output_layer) must detect a GTPShardedParam and produce the
+    GTP-composite offset — i.e. it delegates to the GTP-aware path rather than the vanilla
+    TP-only one. TP=2, GTP=2, column-parallel (tp_axis=0).
+    """
+    from megatron.core.transformer.utils import make_sharded_tensors_for_checkpoint
+
+    gtp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+    tp_group = dist.new_group([0, 2]) if rank in (0, 2) else dist.new_group([1, 3])
+
+    full_out, in_features = 8, 4
+    tp_size, gtp_size = 2, 2
+    per_tp_out = full_out // tp_size  # 4
+    per_shard_out = per_tp_out // gtp_size  # 2
+
+    weight = _make_gtp_shard(per_tp_out, in_features, gtp_group)
+
+    sharded = make_sharded_tensors_for_checkpoint(
+        {"weight": weight},
+        prefix="layer.",
+        tensor_parallel_layers_axis_map={"weight": 0},
+        sharded_offsets=(),
+        tp_group=tp_group,
+        dp_cp_group=dist.new_group(list(range(world_size))),
+    )
+    st = sharded["layer.weight"]
+    assert isinstance(st, ShardedTensor), f"Expected ShardedTensor, got {type(st)}"
+    tp_rank = rank // 2
+    gtp_rank = rank % 2
+    expected_offset = (tp_rank * gtp_size + gtp_rank) * per_shard_out
+    assert st.global_offset[0] == expected_offset, (
+        f"rank={rank} public wrapper did not produce the GTP-composite offset: "
+        f"{st.global_offset[0]} != {expected_offset} (delegation to the GTP path failed?)"
+    )
+    assert (
+        st.global_shape[0] == full_out
+    ), f"rank={rank} global axis-0 {st.global_shape[0]} != {full_out}"
+
+
+def _worker_helper_replicated_sink_rejects_gtp(rank, world_size, port):
+    """Sanity guard: a GTPShardedParam must NEVER be saved via the replicated
+    make_sharded_tensor_for_checkpoint (it would record a shard-sized global shape).
+    The helper asserts; this pins that behaviour.
+    """
+    from megatron.core.utils import make_sharded_tensor_for_checkpoint
+
+    gtp_group = dist.new_group([0, 1]) if rank in (0, 1) else dist.new_group([2, 3])
+    weight = _make_gtp_shard(4, 4, gtp_group)
+    with pytest.raises(AssertionError):
+        make_sharded_tensor_for_checkpoint(
+            weight,
+            "weight",
+            tp_group=dist.new_group([rank]),
+            dp_cp_group=dist.new_group(list(range(world_size))),
+        )
+
+
+def _worker_mamba_replicated_param_replica_ids(rank, world_size, port):
+    """End-to-end ``MambaMixer.sharded_state_dict`` under GTP: the GTP-REPLICATED
+    directly-owned params (A_log / dt_bias / D / conv1d.*) must get conflict-free
+    replica_ids — distinct across every rank holding the same chunk, with exactly
+    one "main" (writer) replica — so DCP elects a single writer per chunk.
+
+    With TP=1 these params are full on every rank, so all ``world_size`` replicas
+    of each must have unique replica_ids and exactly one writer. This is the
+    invariant the gtp_rank replica_id fixup defends; it must hold whether or not
+    that fixup runs (the gtp-inclusive dp_cp rank already disambiguates peers).
+    """
+    from megatron.core import parallel_state as ps
+    from megatron.core.dist_checkpointing.mapping import (
+        ShardedObject,
+        ShardedTensorFactory,
+        is_main_replica,
+    )
+    from megatron.core.extensions.transformer_engine import (
+        TELayerNormColumnParallelLinear,
+        TERowParallelLinear,
+    )
+    from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.ssm.mamba_layer import MambaLayer, MambaLayerSubmodules
+    from megatron.core.ssm.mamba_mixer import MambaMixer, MambaMixerSubmodules
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+    from megatron.core.transformer.spec_utils import ModuleSpec
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    GTP = 2  # world=4 -> tp1 * gtp2 * dp2 (exercises both gtp peers and replicate DP)
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=GTP
+    )
+    model_parallel_cuda_manual_seed(42)
+    pg = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+
+    config = TransformerConfig(
+        num_attention_heads=32,
+        num_layers=1,
+        hidden_size=4096,
+        mamba_num_heads=128,
+        mamba_head_dim=64,
+        mamba_state_dim=128,
+        mamba_num_groups=8,
+        use_mamba_mem_eff_path=True,
+        params_dtype=torch.bfloat16,
+        hidden_dropout=0.0,
+        bias_dropout_fusion=False,
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+    )
+    submodules = MambaLayerSubmodules(
+        mixer=ModuleSpec(
+            module=MambaMixer,
+            submodules=MambaMixerSubmodules(
+                in_proj=TELayerNormColumnParallelLinear, out_proj=TERowParallelLinear
+            ),
+        ),
+        mamba_bda=get_bias_dropout_add,
+    )
+    layer = MambaLayer(config, submodules, layer_number=1, pg_collection=pg).cuda()
+    assert any(
+        isinstance(p, GTPShardedParam) for p in layer.parameters()
+    ), "GTP not active: no GTPShardedParam in the GTP=2 Mamba layer"
+
+    metadata = {'dp_cp_group': ps.get_data_parallel_group(with_context_parallel=True)}
+    sd = layer.mixer.sharded_state_dict(prefix='mixer.', metadata=metadata)
+
+    target_bases = {'A_log', 'dt_bias', 'D', 'conv1d.weight', 'conv1d.bias'}
+    local = {}
+    for key, val in sd.items():
+        base = key.split('mixer.', 1)[-1]
+        if base in target_bases and isinstance(
+            val, (ShardedTensor, ShardedTensorFactory, ShardedObject)
+        ):
+            rid = val.replica_id
+            if isinstance(rid, tuple):
+                local[base] = tuple(rid)
+
+    gathered = [None] * world_size
+    dist.all_gather_object(gathered, local)
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    if rank == 0:
+        bases = set(gathered[0])
+        assert bases, "no GTP-replicated tiny params found in MambaMixer sharded_state_dict"
+        for base in sorted(bases):
+            rids = [g[base] for g in gathered]
+            assert (
+                len(set(rids)) == world_size
+            ), f"{base}: replica_id collision across ranks -> DCP write conflict: {rids}"
+            n_writers = sum(is_main_replica(r) for r in rids)
+            assert n_writers == 1, f"{base}: expected exactly 1 writer, got {n_writers}: {rids}"
+
+
+def _worker_mamba_inproj_optim_param_map(rank, world_size, port):
+    """GTP+Muon checkpoint fix: in_proj's gathered+split model entry does NOT object-id-match the
+    per-shard optimizer param, so get_param_id_to_sharded_param_map misses it (the KeyError seen in
+    Float16OptimizerWithFloat16Params.sharded_state_dict). Verify the per-shard fallback used by the
+    fix restores a ShardedTensor with local_shape == the optimizer param shape, which
+    make_sharded_optimizer_tensor then accepts.
+    """
+    from megatron.core import parallel_state as ps
+    from megatron.core.dist_checkpointing.optimizer import (
+        get_param_id_to_sharded_param_map,
+        make_sharded_optimizer_tensor,
+    )
+    from megatron.core.extensions.transformer_engine import (
+        TELayerNormColumnParallelLinear,
+        TERowParallelLinear,
+    )
+    from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.ssm.mamba_layer import MambaLayer, MambaLayerSubmodules
+    from megatron.core.ssm.mamba_mixer import MambaMixer, MambaMixerSubmodules
+    from megatron.core.tensor_parallel.gtp import (
+        make_sharded_tensors_for_checkpoint_with_gtp,
+        tag_gtp_params_with_names,
+    )
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+    from megatron.core.transformer.spec_utils import ModuleSpec
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+    )
+    model_parallel_cuda_manual_seed(42)
+    pg = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    config = TransformerConfig(
+        num_attention_heads=32,
+        num_layers=1,
+        hidden_size=4096,
+        mamba_num_heads=128,
+        mamba_head_dim=64,
+        mamba_state_dim=128,
+        mamba_num_groups=8,
+        use_mamba_mem_eff_path=True,
+        params_dtype=torch.bfloat16,
+        hidden_dropout=0.0,
+        bias_dropout_fusion=False,
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+    )
+    submodules = MambaLayerSubmodules(
+        mixer=ModuleSpec(
+            module=MambaMixer,
+            submodules=MambaMixerSubmodules(
+                in_proj=TELayerNormColumnParallelLinear, out_proj=TERowParallelLinear
+            ),
+        ),
+        mamba_bda=get_bias_dropout_add,
+    )
+    layer = MambaLayer(config, submodules, layer_number=1, pg_collection=pg).cuda()
+    tag_gtp_params_with_names(layer)  # set _debug_name (mirrors production setup)
+
+    in_proj_w = layer.mixer.in_proj.weight
+    assert isinstance(in_proj_w, GTPShardedParam), "in_proj.weight should be GTP-sharded"
+
+    metadata = {'dp_cp_group': ps.get_data_parallel_group(with_context_parallel=True)}
+    model_sd = layer.mixer.sharded_state_dict(prefix='mixer.', metadata=metadata)
+
+    # Reproduce the gap: in_proj's per-shard optim param has no id-match in the model dict.
+    id_map = get_param_id_to_sharded_param_map(model_sd, [in_proj_w])
+    assert 0 not in id_map, "expected in_proj to be MISSING from id map (the KeyError gap)"
+
+    # The fix's per-shard fallback restores a matching entry.
+    key = in_proj_w._debug_name or '_gtp_optim_param_0'
+    entry = make_sharded_tensors_for_checkpoint_with_gtp(
+        {key: in_proj_w},
+        prefix='',
+        tensor_parallel_layers_axis_map={key: 0},
+        tp_group=ps.get_tensor_model_parallel_group(),
+        dp_cp_group=ps.get_data_parallel_group(with_context_parallel=True),
+    )[key]
+    assert tuple(entry.local_shape) == tuple(in_proj_w.shape), (
+        f"per-shard entry local_shape {tuple(entry.local_shape)} != param shape "
+        f"{tuple(in_proj_w.shape)}"
+    )
+    # make_sharded_optimizer_tensor must accept it for a same-shape optimizer state tensor.
+    opt_state = torch.zeros_like(in_proj_w)
+    osh = make_sharded_optimizer_tensor(entry, opt_state, prefix='optimizer.state.exp_avg')
+    assert osh is not None
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+
+# ---------------------------------------------------------------------------
+# Test class wrappers (4-GPU)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.run_only_on_devices_with_compute_capability(compute_capability=(10, 0))
+class TestGtpDcpHelper:
+    def test_mamba_replicated_param_replica_ids(self):
+        _require_world_size(4)
+        _worker_mamba_replicated_param_replica_ids(dist.get_rank(), 4, None)
+
+    def test_mamba_inproj_optim_param_map(self):
+        _require_world_size(4)
+        _worker_mamba_inproj_optim_param_map(dist.get_rank(), 4, None)
+
+    def test_composite_offset_same_axis(self):
+        _require_world_size(4)
+        _worker_helper_offsets_tp_eq_gtp_axis(dist.get_rank(), 4, None)
+
+    def test_dual_offsets_cross_axis(self):
+        _require_world_size(4)
+        _worker_helper_offsets_tp_neq_gtp_axis(dist.get_rank(), 4, None)
+
+    def test_ep_egtp_offsets(self):
+        _require_world_size(4)
+        _worker_helper_offsets_ep_egtp(dist.get_rank(), 4, None)
+
+    def test_embedding_offsets(self):
+        _require_world_size(4)
+        _worker_helper_embedding_offsets(dist.get_rank(), 4, None)
+
+    def test_public_wrapper_delegates(self):
+        _require_world_size(4)
+        _worker_helper_public_wrapper_delegates(dist.get_rank(), 4, None)
+
+    def test_replicated_sink_rejects_gtp(self):
+        _require_world_size(4)
+        _worker_helper_replicated_sink_rejects_gtp(dist.get_rank(), 4, None)
+
+    def test_no_op_no_gtp(self):
+        _require_world_size(4)
+        _worker_helper_no_op_no_gtp(dist.get_rank(), 4, None)
+
+    def test_reset_quantize_cache(self):
+        _require_world_size(4)
+        _worker_reset_quantize_cache(dist.get_rank(), 4, None)
+
+    def test_inproj_no_pad(self):
+        _require_world_size(4)
+        _worker_helper_padded_inproj_no_pad_case(dist.get_rank(), 4, None)
+
+    def test_inproj_with_pad(self):
+        _require_world_size(4)
+        _worker_helper_padded_inproj_pad_case(dist.get_rank(), 4, None)
+
+    def test_cross_topology_reshard_metadata(self):
+        _require_world_size(4)
+        _worker_helper_cross_topology_reshard_metadata(dist.get_rank(), 4, None)
+
+    def test_save_then_load_offsets_symmetric(self):
+        _require_world_size(4)
+        _worker_save_then_load_offsets_symmetric(dist.get_rank(), 4, None)
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_gtp_grad_correctness.py b/tests/unit_tests/generalized_tensor_parallel/test_gtp_grad_correctness.py
new file mode 100644
index 00000000000..831cf1c13e1
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_gtp_grad_correctness.py
@@ -0,0 +1,541 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Numeric repro: GTP gradient correctness through the REAL
+DDP + distributed-optimizer + finalize path, with replicate (DP) > 1.
+
+The validated loss-trajectory test uses DP=1 (replicate=1) and manual
+SGD on main_grad, so it cannot catch a gradient-reduction error that only shows
+up when the dist-opt shards over a replicate group of size > 1 (the new-at-64-GPU
+condition: DP2 x GTP16). This test reproduces that condition at small scale
+(world=4 = GTP2 x DP2) and checks the gradient end-to-end against a trusted
+no-GTP DP=4 baseline.
+
+Decisive choices:
+  * SGD lr=1.0 (NOT Adam): the step is scale-SENSITIVE, so a gtp x gradient
+    under-scale shows up directly as a gtp x smaller weight delta. Adam would
+    normalize a uniform scale error away and mask the bug.
+  * Distinct input per rank (seed=rank): each data-parallel position sees a
+    different batch (the HSDP guarantee), so the correct reduced grad is the
+    MEAN over all 4 positions. Baseline (DP4) and GTP (GTP2xDP2) both
+    span the same 4 positions, so their reduced grads -- and thus post-step
+    weights and grad-norm -- must match.
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TransformerEngine >= 2.17", allow_module_level=True)
+
+from megatron.core.tensor_parallel.gtp import GTPShardedParam
+from tests.unit_tests.generalized_tensor_parallel.gtp_test_utils import (  # noqa: F401  (autouse, module-scoped: initializes the dist PG); noqa: F401  (autouse)
+    _run_distributed,
+    _torchrun_dist_init,
+    reset_fp8_state,
+    reset_gtp_globals,
+)
+
+HIDDEN = 256
+NUM_HEADS = 8
+FFN_HIDDEN = 512
+NUM_LAYERS = 1
+SEQ = 16
+BATCH = 1
+LR = 1.0  # scale-sensitive SGD step
+dtype = torch.bfloat16
+
+
+def _make_config():
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    return TransformerConfig(
+        num_attention_heads=NUM_HEADS,
+        num_layers=NUM_LAYERS,
+        hidden_size=HIDDEN,
+        ffn_hidden_size=FFN_HIDDEN,
+        add_bias_linear=False,
+        params_dtype=dtype,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        bias_dropout_fusion=False,
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+    )
+
+
+def _make_stack(config, pg_collection):
+    from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
+
+    spec = get_gpt_layer_with_transformer_engine_spec()
+    return torch.nn.ModuleList(
+        [
+            spec.module(config, spec.submodules, layer_number=i + 1, pg_collection=pg_collection)
+            for i in range(NUM_LAYERS)
+        ]
+    )
+
+
+def _build_ddp(stack):
+    """Wrap the stack in a NON-distributed-optimizer DDP so main_grad holds the
+    full all-reduced gradient (no optimizer needed; no Adam scale-invariance to
+    mask a scaling error)."""
+    from megatron.core.distributed import DistributedDataParallel, DistributedDataParallelConfig
+
+    config = _make_config()
+    ddp_config = DistributedDataParallelConfig(
+        use_distributed_optimizer=False, overlap_grad_reduce=False
+    )
+    module = torch.nn.Sequential()
+    for i, layer in enumerate(stack):
+        module.add_module(str(i), layer)
+    return DistributedDataParallel(config, ddp_config, module)
+
+
+def _run_one_backward(ddp_model, rank):
+    ddp_model.zero_grad_buffer()
+    # Distinct input per rank => the correct reduced grad is the MEAN over ranks.
+    torch.manual_seed(1000 + rank)
+    x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+    out = x
+    for layer in ddp_model.module.children():
+        out, _ = layer(out, attention_mask=None)
+    loss = out.float().mean()
+    loss.backward()
+    # Sync ONCE: finish_grad_sync() triggers the (single) grad reduction for
+    # overlap_grad_reduce=False. Do NOT also call start_grad_sync() — that double-
+    # reduces, which is idempotent at full-DP size but halves at replicate size.
+    ddp_model.finish_grad_sync()
+    from megatron.core.distributed.finalize_model_grads import (
+        _allreduce_replicated_grads_over_gtp_group,
+    )
+
+    _allreduce_replicated_grads_over_gtp_group([ddp_model])
+    return float(loss.item())
+
+
+def _full_main_grads(stack):
+    """Reconstruct full (unsharded) reduced gradients keyed by param name.
+
+    GTPShardedParam.main_grad is the local gtp shard -> all-gather over the gtp
+    group. Non-GTP params are replicated -> take the local (already gtp-summed) copy.
+    """
+    from megatron.core import parallel_state as ps
+
+    out = {}
+    for layer in stack:
+        for name, p in layer.named_parameters():
+            g_attr = 'main_grad' if hasattr(p, 'main_grad') else 'grad'
+            mg = getattr(p, g_attr)
+            if isinstance(p, GTPShardedParam):
+                g = ps.get_gtp_weight_remat_group()
+                shards = [torch.empty_like(mg) for _ in range(g.size())]
+                dist.all_gather(shards, mg.contiguous(), group=g)
+                out[name] = torch.cat(shards, dim=0).float().cpu()
+            else:
+                out[name] = mg.detach().float().cpu()
+    return out
+
+
+def _worker(rank, world_size, port):
+    from megatron.core import parallel_state as ps
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+
+    # ---------- Phase A: baseline, GTP=1 DP=4 (trusted standard path) ----------
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=1
+    )
+    model_parallel_cuda_manual_seed(42)
+    pgc = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    base_stack = _make_stack(_make_config(), pgc)
+    for layer in base_stack:
+        layer.cuda()
+    for p in base_stack.parameters():
+        dist.broadcast(p.data, src=0)
+    saved = {n: p.data.clone() for n, p in base_stack.named_parameters()}
+
+    base_ddp = _build_ddp(base_stack)
+    _run_one_backward(base_ddp, rank)
+    base_grads = _full_main_grads(base_stack)
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    # ---------- Phase B: GTP=2 DP=2 (replicate>1!) ----------
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+    )
+    model_parallel_cuda_manual_seed(42)
+    pgc = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    gtp_stack = _make_stack(_make_config(), pgc)
+    for layer in gtp_stack:
+        layer.cuda()
+
+    g = ps.get_gtp_weight_remat_group()
+    gtp_rank = g.rank()
+    assert g.size() == 2, f"expected gtp shard group size 2, got {g.size()}"
+
+    # Load the SAME init weights as baseline: GTP params get their gtp shard.
+    for name, p in gtp_stack.named_parameters():
+        full = saved[name]
+        if isinstance(p, GTPShardedParam):
+            ss = p.shape[0]
+            p.data.copy_(full[gtp_rank * ss : (gtp_rank + 1) * ss])
+        else:
+            p.data.copy_(full)
+
+    gtp_ddp = _build_ddp(gtp_stack)
+    _run_one_backward(gtp_ddp, rank)
+    gtp_grads = _full_main_grads(gtp_stack)
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    # ---------- Compare reduced gradients on rank 0 ----------
+    if rank == 0:
+        max_err = 0.0
+        worst = None
+        for name in base_grads:
+            bg, gg = base_grads[name], gtp_grads[name]
+            assert bg.shape == gg.shape, f"{name}: {bg.shape} vs {gg.shape}"
+            err = (bg - gg).abs().max().item()
+            denom = bg.abs().max().item() + 1e-8
+            rel = err / denom
+            ratio = (gg.norm() / (bg.norm() + 1e-12)).item()
+            print(
+                f"[grad] {name:55s} rel_max_err={rel:.3e}  norm_ratio(orth/base)={ratio:.4f}",
+                flush=True,
+            )
+            if rel > max_err:
+                max_err, worst = rel, name
+        print(
+            f"[summary] max relative grad error GTP-vs-DP4-baseline = {max_err:.3e} "
+            f"(worst: {worst})",
+            flush=True,
+        )
+        assert max_err < 2e-2, (
+            f"GTP2xDP2 reduced gradient does not match the no-GTP DP4 baseline "
+            f"(max rel err {max_err:.3e} on {worst}) -> gtp-axis grad reduction/scaling error."
+        )
+
+
+# ---------------------------------------------------------------------------
+# Distributed-optimizer + grad-norm path (the production 64-GPU path)
+# ---------------------------------------------------------------------------
+
+
+def _build_ddp_distopt_and_optim(stack):
+    """Real distributed-optimizer setup (Adam), matching the 64-GPU production path."""
+    from megatron.core.distributed import DistributedDataParallel, DistributedDataParallelConfig
+    from megatron.core.optimizer import OptimizerConfig, get_megatron_optimizer
+
+    config = _make_config()
+    ddp_config = DistributedDataParallelConfig(
+        use_distributed_optimizer=True, overlap_grad_reduce=False
+    )
+    module = torch.nn.Sequential()
+    for i, layer in enumerate(stack):
+        module.add_module(str(i), layer)
+    ddp_model = DistributedDataParallel(config, ddp_config, module)
+    opt_config = OptimizerConfig(
+        optimizer='adam',
+        lr=0.01,
+        bf16=True,
+        use_distributed_optimizer=True,
+        use_precision_aware_optimizer=False,
+        main_params_dtype=torch.float32,
+        main_grads_dtype=torch.float32,
+        exp_avg_dtype=torch.float32,
+        exp_avg_sq_dtype=torch.float32,
+        clip_grad=1.0,  # reported grad-norm is computed pre-clip, so this is just for the step
+    )
+    optim = get_megatron_optimizer(opt_config, [ddp_model])
+    return ddp_model, optim
+
+
+def _run_step_distopt(ddp_model, optim, rank):
+    """Mirror production finalize order: finish_grad_sync -> gtp-finalize -> optim.step().
+    Returns the optimizer-reported grad-norm (computed pre-clip from the reduced grads)."""
+    optim.zero_grad()
+    ddp_model.zero_grad_buffer()
+    torch.manual_seed(1000 + rank)
+    x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+    out = x
+    for layer in ddp_model.module.children():
+        out, _ = layer(out, attention_mask=None)
+    loss = out.float().mean()
+    loss.backward()
+    # Production order (finalize_model_grads): reduce across DP first, THEN the gtp finalize.
+    ddp_model.finish_grad_sync()
+    from megatron.core.distributed.finalize_model_grads import (
+        _allreduce_replicated_grads_over_gtp_group,
+    )
+
+    _allreduce_replicated_grads_over_gtp_group([ddp_model])
+    _, grad_norm, _ = optim.step()
+    return float(grad_norm)
+
+
+def _worker_distopt(rank, world_size, port):
+    from megatron.core import parallel_state as ps
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+
+    # ---------- Phase A: baseline, GTP=1 DP=4, dist-opt + Adam ----------
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=1
+    )
+    model_parallel_cuda_manual_seed(42)
+    pgc = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    base_stack = _make_stack(_make_config(), pgc)
+    for layer in base_stack:
+        layer.cuda()
+    for p in base_stack.parameters():
+        dist.broadcast(p.data, src=0)
+    saved = {n: p.data.clone() for n, p in base_stack.named_parameters()}
+    base_ddp, base_optim = _build_ddp_distopt_and_optim(base_stack)
+    base_gn = _run_step_distopt(base_ddp, base_optim, rank)
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    # ---------- Phase B: GTP=2 DP=2, dist-opt + Adam ----------
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+    )
+    model_parallel_cuda_manual_seed(42)
+    pgc = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    gtp_stack = _make_stack(_make_config(), pgc)
+    for layer in gtp_stack:
+        layer.cuda()
+    g = ps.get_gtp_weight_remat_group()
+    gtp_rank = g.rank()
+    for name, p in gtp_stack.named_parameters():
+        full = saved[name]
+        if isinstance(p, GTPShardedParam):
+            ss = p.shape[0]
+            p.data.copy_(full[gtp_rank * ss : (gtp_rank + 1) * ss])
+        else:
+            p.data.copy_(full)
+    gtp_ddp, gtp_optim = _build_ddp_distopt_and_optim(gtp_stack)
+    gtp_gn = _run_step_distopt(gtp_ddp, gtp_optim, rank)
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    if rank == 0:
+        ratio = gtp_gn / max(base_gn, 1e-12)
+        print(
+            f"\n[distopt grad-norm] baseline={base_gn:.6f}  GTP={gtp_gn:.6f}  "
+            f"ratio={ratio:.4f}",
+            flush=True,
+        )
+        # Same model, same data, gradients proven equal -> grad-norm must match.
+        torch.testing.assert_close(torch.tensor(gtp_gn), torch.tensor(base_gn), atol=0, rtol=3e-2)
+
+
+# ---------------------------------------------------------------------------
+# MoE + EGTP dist-opt grad-norm path (a55b has experts; EGTP shards expert weights)
+# ---------------------------------------------------------------------------
+
+NUM_EXPERTS = 4
+MOE_FFN = 256
+
+
+def _make_moe_config():
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    return TransformerConfig(
+        num_attention_heads=NUM_HEADS,
+        num_layers=NUM_LAYERS,
+        hidden_size=HIDDEN,
+        ffn_hidden_size=FFN_HIDDEN,
+        num_moe_experts=NUM_EXPERTS,
+        moe_router_topk=2,
+        moe_ffn_hidden_size=MOE_FFN,
+        moe_grouped_gemm=True,
+        moe_token_dispatcher_type="alltoall",
+        moe_aux_loss_coeff=0.0,
+        add_bias_linear=False,
+        params_dtype=dtype,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        bias_dropout_fusion=False,
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+    )
+
+
+def _make_moe_stack(config, pg_collection):
+    from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
+
+    spec = get_gpt_layer_with_transformer_engine_spec(
+        num_experts=NUM_EXPERTS, moe_grouped_gemm=True
+    )
+    return torch.nn.ModuleList(
+        [
+            spec.module(config, spec.submodules, layer_number=i + 1, pg_collection=pg_collection)
+            for i in range(NUM_LAYERS)
+        ]
+    )
+
+
+def _is_expert_param(name, p):
+    return ('experts' in name) or (not getattr(p, 'allreduce', True))
+
+
+def _worker_moe_distopt(rank, world_size, port):
+    from megatron.core import parallel_state as ps
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+
+    pgs = ['tp', 'cp', 'gtp', 'ep']
+
+    # ---------- Phase A: baseline GTP1/EGTP1, EP2 (DP2 dense / expert_dp2) ----------
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+        expert_model_parallel_size=2,
+        gtp_remat_size=1,
+        expert_gtp_remat_size=1,
+    )
+    model_parallel_cuda_manual_seed(42)
+    pgc = ProcessGroupCollection.use_mpu_process_groups(required_pgs=pgs)
+    base_stack = _make_moe_stack(_make_moe_config(), pgc)
+    for layer in base_stack:
+        layer.cuda()
+    # Broadcast only NON-expert (dense) params; expert weights are EP-local and must
+    # stay rank-distinct. Save all params per-rank for the GTP phase to mirror.
+    for name, p in base_stack.named_parameters():
+        if not _is_expert_param(name, p):
+            dist.broadcast(p.data, src=0)
+    saved = {n: p.data.clone() for n, p in base_stack.named_parameters()}
+    base_ddp, base_optim = _build_ddp_distopt_and_optim(base_stack)
+    base_gn = _run_step_distopt(base_ddp, base_optim, rank)
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    # ---------- Phase B: GTP2/EGTP2, EP2 (EGTP actually shards experts) ----------
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+        expert_model_parallel_size=2,
+        gtp_remat_size=2,
+        expert_gtp_remat_size=2,
+    )
+    model_parallel_cuda_manual_seed(42)
+    pgc = ProcessGroupCollection.use_mpu_process_groups(required_pgs=pgs)
+    moe_stack = _make_moe_stack(_make_moe_config(), pgc)
+    for layer in moe_stack:
+        layer.cuda()
+    g = ps.get_gtp_weight_remat_group()
+    eg = ps.get_expert_gtp_weight_remat_group()
+    gtp_rank, egtp_rank = g.rank(), eg.rank()
+    n_egtp_sharded = 0
+    for name, p in moe_stack.named_parameters():
+        full = saved[name]  # EP2 layout identical to baseline -> rank-local match
+        if isinstance(p, GTPShardedParam):
+            # dense GTP shards over the gtp group; expert (EGTP) shards over the egtp group.
+            is_expert = _is_expert_param(name, p)
+            r = egtp_rank if is_expert else gtp_rank
+            ss = p.shape[0]
+            p.data.copy_(full[r * ss : (r + 1) * ss])
+            if is_expert:
+                n_egtp_sharded += 1
+        else:
+            p.data.copy_(full)
+    if rank == 0:
+        print(
+            f"[moe-egtp] egtp-sharded expert params = {n_egtp_sharded} (must be >0 to be a "
+            f"faithful EGTP test)",
+            flush=True,
+        )
+    moe_ddp, moe_optim = _build_ddp_distopt_and_optim(moe_stack)
+    moe_gn = _run_step_distopt(moe_ddp, moe_optim, rank)
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    if rank == 0:
+        ratio = moe_gn / max(base_gn, 1e-12)
+        print(
+            f"\n[moe distopt grad-norm] baseline={base_gn:.6f}  GTP={moe_gn:.6f}  "
+            f"ratio={ratio:.4f}",
+            flush=True,
+        )
+        torch.testing.assert_close(torch.tensor(moe_gn), torch.tensor(base_gn), atol=0, rtol=3e-2)
+
+
+def _worker_idog_span(rank, world_size, port):
+    """Dist-opt grad-stats group (intra_dist_opt) must span the FULL world for both
+    dense-only and MoE(EP2/EGTP2) configs. A naive build collapses the MoE case to a sub-world
+    group (egtp factored out of expert_data_parallel_size), under-counting the grad-norm."""
+    from megatron.core import parallel_state as ps
+
+    # MoE EP2 EGTP2 GTP2 (the a55b-shaped expert config).
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+        expert_model_parallel_size=2,
+        gtp_remat_size=2,
+        expert_gtp_remat_size=2,
+    )
+    moe_idog = ps.get_intra_distributed_optimizer_instance_group().size()
+    ps.destroy_model_parallel()
+    # Dense-only GTP2 (must remain world too).
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+    )
+    dense_idog = ps.get_intra_distributed_optimizer_instance_group().size()
+    ps.destroy_model_parallel()
+    if rank == 0:
+        print(
+            f"[idog] MoE intra_dist_opt.size={moe_idog} dense.size={dense_idog} "
+            f"(world={world_size})",
+            flush=True,
+        )
+        assert moe_idog == world_size, (
+            f"MoE grad-stats group = {moe_idog}, expected world {world_size} "
+            f"-> grad-norm would under-count gtp/egtp-sharded params"
+        )
+        assert dense_idog == world_size, f"dense grad-stats group = {dense_idog}"
+
+
+class TestGTPGradCorrectness:
+    def test_distopt_gradstats_group_spans_world(self):
+        """intra_dist_opt_group (grad-stats) must span the full world."""
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires 4 CUDA devices")
+        _run_distributed(_worker_idog_span, 4)
+
+    def test_gtp2_dp2_grad_matches_dp4_baseline(self):
+        """GTP2xDP2 reduced grad must match no-GTP DP4 (non-dist-opt main_grad)."""
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires 4 CUDA devices")
+        _run_distributed(_worker, 4)
+
+    def test_gtp2_dp2_distopt_grad_norm_matches_dp4_baseline(self):
+        """GTP2xDP2 dist-opt grad-norm must match no-GTP DP4 (the 64-GPU path)."""
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires 4 CUDA devices")
+        _run_distributed(_worker_distopt, 4)
+
+    @pytest.mark.skip(
+        reason="EP=2 (engages EGTP) but the minimal test dims (SEQ16 BATCH1 hidden256) hit a "
+        "token-dispatcher shape error in the alltoall path (RuntimeError shape [2,1,4]). Needs a "
+        "larger MoE config to run; left as a stub. The real EGTP path is validated by the a55b "
+        "re-run (loss matches the GTP1/EGTP1 baseline after the is_gtp/allreduce master-param fix)."
+    )
+    def test_moe_egtp_distopt_grad_norm_matches_baseline(self):
+        """GTP2/EGTP2 MoE dist-opt grad-norm must match GTP1/EGTP1 baseline (EP=2 both)."""
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires 4 CUDA devices")
+        _run_distributed(_worker_moe_distopt, 4)
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_gtp_loss_correctness.py b/tests/unit_tests/generalized_tensor_parallel/test_gtp_loss_correctness.py
new file mode 100644
index 00000000000..1d1630dc585
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_gtp_loss_correctness.py
@@ -0,0 +1,197 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Integration test for GTP correctness.
+
+Validates that GTP run as a first-class parallelism axis
+(world_size = TP * GTP * CP * DP) produces the same per-step loss as a no-GTP
+baseline. This is the end-to-end proof that the standalone-GTP rank grid built
+in parallel_state trains correctly.
+
+Mirrors TestAttentionGTPCorrectness. With world=4 and gtp_remat_size=4, GTP
+yields dp_replicate=1 and a single shard group [0,1,2,3], so the loss must match
+the GTP=1 baseline.
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TransformerEngine >= 2.17", allow_module_level=True)
+
+from transformer_engine.pytorch import fp8_autocast
+
+from megatron.core.tensor_parallel.gtp import GTPShardedParam
+from tests.unit_tests.generalized_tensor_parallel.gtp_test_utils import (  # noqa: F401  (autouse, module-scoped: initializes the dist PG); noqa: F401  (autouse)
+    _requires_mxfp8,
+    _run_distributed,
+    _torchrun_dist_init,
+    reset_fp8_state,
+    reset_gtp_globals,
+)
+
+
+def _worker_gtp_loss_correctness(rank, world_size, port):
+    """Baseline (GTP=1, DP=4) vs GTP=4 (world=TP1*GTP4*CP1*DP1)."""
+    from transformer_engine.common.recipe import MXFP8BlockScaling
+    from transformer_engine.pytorch.quantization import FP8GlobalStateManager
+
+    from megatron.core import parallel_state as ps
+    from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_with_transformer_engine_spec
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    HIDDEN = 4096
+    NUM_HEADS = 32
+    FFN_HIDDEN = 16384
+    NUM_LAYERS = 2
+    SEQ = 32
+    BATCH = 1
+    LR = 0.01
+    STEPS = 10
+    dtype = torch.bfloat16
+    recipe = MXFP8BlockScaling()
+
+    def make_config():
+        return TransformerConfig(
+            num_attention_heads=NUM_HEADS,
+            num_layers=NUM_LAYERS,
+            hidden_size=HIDDEN,
+            ffn_hidden_size=FFN_HIDDEN,
+            add_bias_linear=False,
+            params_dtype=dtype,
+            hidden_dropout=0.0,
+            attention_dropout=0.0,
+            bias_dropout_fusion=False,
+            fp8='e4m3',
+            tensor_model_parallel_size=1,
+            pipeline_model_parallel_size=1,
+        )
+
+    def make_transformer_stack(config, pg_collection):
+        spec = get_gpt_layer_with_transformer_engine_spec()
+        return torch.nn.ModuleList(
+            [
+                spec.module(
+                    config, spec.submodules, layer_number=i + 1, pg_collection=pg_collection
+                )
+                for i in range(NUM_LAYERS)
+            ]
+        )
+
+    def run_step(layers, x):
+        with fp8_autocast(enabled=True, fp8_recipe=recipe):
+            for layer in layers:
+                x, _ = layer(x, attention_mask=None)
+        return x.mean()
+
+    # ---- Phase 1: Baseline — GTP=1 (DP=4) ----
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=1
+    )
+    model_parallel_cuda_manual_seed(42)
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    config = make_config()
+    layers = make_transformer_stack(config, pg_collection)
+    for layer in layers:
+        layer.cuda()
+    for p in layers.parameters():
+        dist.broadcast(p.data, src=0)
+    saved_weights = {n: p.data.clone() for n, p in layers.named_parameters()}
+
+    baseline_losses = []
+    for step in range(STEPS):
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+        loss = run_step(layers, x)
+        if rank == 0:
+            baseline_losses.append(loss.item())
+        loss.backward()
+        with torch.no_grad():
+            for p in layers.parameters():
+                if p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+    FP8GlobalStateManager.reset()
+
+    # ---- Phase 2: GTP=4 (world = TP1 * GTP4 * CP1 * DP1) ----
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+        gtp_remat_size=4,  # standalone-axis GTP under test
+    )
+    model_parallel_cuda_manual_seed(42)
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    config = make_config()
+    layers_gtp = make_transformer_stack(config, pg_collection)
+    for layer in layers_gtp:
+        layer.cuda()
+
+    gtp_group = ps.get_gtp_weight_remat_group()
+    gtp_size = gtp_group.size()
+    gtp_rank = gtp_group.rank()
+    assert gtp_size == 4, f"GTP shard group size should be 4, got {gtp_size}"
+
+    gtp_params = [p for p in layers_gtp.parameters() if isinstance(p, GTPShardedParam)]
+    assert len(gtp_params) > 0, "GTP not active: no GTPShardedParam found"
+
+    for name, p in layers_gtp.named_parameters():
+        full = saved_weights[name]
+        if isinstance(p, GTPShardedParam):
+            shard_size = p.shape[0]
+            p.data.copy_(full[gtp_rank * shard_size : (gtp_rank + 1) * shard_size])
+        else:
+            p.data.copy_(full)
+
+    for p in layers_gtp.parameters():
+        if isinstance(p, GTPShardedParam):
+            p.main_grad = torch.zeros(p.shape, dtype=dtype, device='cuda')
+
+    gtp_losses = []
+    for step in range(STEPS):
+        for p in layers_gtp.parameters():
+            if isinstance(p, GTPShardedParam):
+                p.main_grad.zero_()
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+        loss = run_step(layers_gtp, x)
+        if rank == 0:
+            gtp_losses.append(loss.item())
+        loss.backward()
+        with torch.no_grad():
+            for p in layers_gtp.parameters():
+                if isinstance(p, GTPShardedParam):
+                    p.data.sub_((LR / gtp_size) * p.main_grad)
+                elif p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    if rank == 0:
+        assert len(baseline_losses) == STEPS and len(gtp_losses) == STEPS
+        for step, (lb, lg) in enumerate(zip(baseline_losses, gtp_losses)):
+            print(f"Step {step:2d}: baseline={lb:.6f}  orth_gtp={lg:.6f}", flush=True)
+        torch.testing.assert_close(
+            torch.tensor(gtp_losses), torch.tensor(baseline_losses), atol=1e-5, rtol=1e-5
+        )
+
+
+class TestGTPLossCorrectness:
+    def test_gtp_loss_trajectory_matches_baseline(self):
+        """GTP=4 per-step losses must match no-GTP baseline (atol=1e-5, rtol=1e-5)."""
+        _requires_mxfp8()
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires at least 4 CUDA devices")
+        _run_distributed(_worker_gtp_loss_correctness, 4)
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_gtp_muon_dcp.py b/tests/unit_tests/generalized_tensor_parallel/test_gtp_muon_dcp.py
new file mode 100644
index 00000000000..6b8fb1fe51c
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_gtp_muon_dcp.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Unit tests for GTP + Muon (LayerWise) distributed checkpointing.
+
+Covers the optimizer-state checkpoint roundtrip for the
+:class:`LayerWiseDistributedOptimizer` (Muon) under GTP, where GTP-replicated
+matrix params (e.g. the MoE router) are kept whole and must be disambiguated
+by ``replica_id`` so DCP does not see multiple writers for the same shard.
+"""
+
+import torch
+
+from megatron.core.dist_checkpointing import load, save
+from tests.unit_tests.dist_checkpointing import TempNamedDir, setup_model_and_optimizer
+from tests.unit_tests.test_utilities import Utils
+
+
+def check_equal(input_1, input_2):
+    """Check if two inputs are equal, used for checking checkpointing."""
+    if isinstance(input_1, dict) and isinstance(input_2, dict):
+        assert input_1.keys() == input_2.keys()
+        for key in input_1.keys():
+            check_equal(input_1[key], input_2[key])
+    elif isinstance(input_1, list) and isinstance(input_2, list):
+        assert len(input_1) == len(input_2)
+        for i in range(len(input_1)):
+            check_equal(input_1[i], input_2[i])
+    elif isinstance(input_1, torch.Tensor) and isinstance(input_2, torch.Tensor):
+        assert torch.all(input_1 == input_2), f"Input 1: {input_1} != Input 2: {input_2}"
+    elif type(input_1) != type(input_2):
+        assert False, f"Input 1 type: {type(input_1)} != Input 2 type: {type(input_2)}"
+    else:
+        assert input_1 == input_2, f"Input 1: {input_1} != Input 2: {input_2}"
+
+
+class TestGTPMuonDCP:
+    """GTP + Muon (LayerWise) distributed checkpointing tests."""
+
+    def teardown_method(self, method):
+        Utils.destroy_model_parallel()
+
+    def test_gtp_muon_moe_save_load(self, tmp_path_dist_ckpt):
+        """GTP + Muon (LayerWise) optimizer-state checkpoint roundtrip.
+
+        GTP-REPLICATED, Muon-managed matrix params (e.g. the MoE router, held identically on every
+        GTP peer) must not collide on GTP peers during checkpoint save: LayerWise keeps each such
+        param whole, so its optimizer-state ShardedTensor has the same key+offset on all GTP peers
+        and the replica_id must distinguish them, or DCP validate_sharding_integrity reports 2
+        writers ('Invalid access pattern ... [[2]]'). Adam dodges this by sharding the state.
+        """
+        import os
+        from functools import partial
+
+        import pytest
+
+        from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+        if not HAVE_GTP:
+            pytest.skip("GTP requires TE with hook registry")
+        if int(os.environ.get('WORLD_SIZE', '1')) != 4:
+            pytest.skip("Requires world_size 4 (gtp2 x dp2)")
+
+        os.environ['MEGATRON_GTP_FORCE_ENABLE'] = '1'
+        from megatron.core import parallel_state as ps
+        from megatron.core.tensor_parallel import model_parallel_cuda_manual_seed
+        from megatron.core.tensor_parallel.gtp import GTP_CONFIG, GTPShardedParam, update_gtp_config
+        from tests.unit_tests.dist_checkpointing.utils import initialize_moe_model
+
+        Utils.initialize_model_parallel(1, 1)  # bootstrap torch.distributed + model parallel
+        ps.destroy_model_parallel()
+        ps.initialize_model_parallel(
+            tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=2
+        )
+        model_parallel_cuda_manual_seed(2)
+        # Disable GTP alignment padding so the tiny test dims slice cleanly by gtp_size.
+        _orig_pad = GTP_CONFIG.pad_for_alignment
+        update_gtp_config(pad_for_alignment=0)
+        # GTP-friendly dims (divisible by gtp_size=2); GPU init (CPU affine init is not GTP-aware
+        # for the strided QKV weight).
+        moe_cfg = dict(
+            hidden_size=64,
+            num_attention_heads=8,
+            kv_channels=8,
+            ffn_hidden_size=128,
+            use_cpu_initialization=False,
+        )
+        meta = {'distrib_optim_sharding_type': 'dp_reshardable'}
+        with TempNamedDir(tmp_path_dist_ckpt / 'gtp_muon_moe_A', sync=True) as ckpt_dir_A:
+            with TempNamedDir(tmp_path_dist_ckpt / 'gtp_muon_moe_B', sync=True) as ckpt_dir_B:
+                model_A, optimizer_A = setup_model_and_optimizer(
+                    seed=2,
+                    tp=1,
+                    pp=1,
+                    bf16=True,
+                    dist_opt=True,
+                    use_param_layout=True,
+                    initialize_fn=partial(initialize_moe_model, use_te=True, **moe_cfg),
+                    optimizer='dist_muon',
+                )
+                assert any(
+                    isinstance(p, GTPShardedParam) for p in model_A[0].parameters()
+                ), "GTP not active: no GTPShardedParam in the GTP=2 MoE model"
+
+                model_sd_A = model_A[0].sharded_state_dict()
+                optim_sd_A = optimizer_A.sharded_state_dict(model_sd_A, metadata=meta)
+                save(
+                    optim_sd_A, ckpt_dir_A
+                )  # fails (2 writers) before the LayerWise replica_id fix
+
+                model_B, optimizer_B = setup_model_and_optimizer(
+                    seed=3,
+                    tp=1,
+                    pp=1,
+                    bf16=True,
+                    dist_opt=True,
+                    use_param_layout=True,
+                    initialize_fn=partial(initialize_moe_model, use_te=True, **moe_cfg),
+                    optimizer='dist_muon',
+                )
+                model_sd_B = model_B[0].sharded_state_dict()
+                load_sharded_sd = optimizer_B.sharded_state_dict(
+                    model_sd_B, is_loading=True, metadata=meta
+                )
+                state_dict = load(load_sharded_sd, ckpt_dir_A)
+                optimizer_B.load_state_dict(state_dict)
+                optim_sd_B = optimizer_B.sharded_state_dict(model_sd_B, metadata=meta)
+                save(optim_sd_B, ckpt_dir_B)
+
+                update_gtp_config(pad_for_alignment=_orig_pad)
+
+                Utils.destroy_model_parallel()
+                Utils.initialize_model_parallel(1, 1)
+                from megatron.core.dist_checkpointing import load_plain_tensors
+
+                check_equal(load_plain_tensors(ckpt_dir_A), load_plain_tensors(ckpt_dir_B))
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_mamba_gtp.py b/tests/unit_tests/generalized_tensor_parallel/test_mamba_gtp.py
new file mode 100644
index 00000000000..5d81c868722
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_mamba_gtp.py
@@ -0,0 +1,260 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Integration tests for GTP + Mamba correctness.
+
+Test groups
+-----------
+TestMambaGTPCorrectness  - GTP Mamba loss trajectory matches baseline (no-GTP) over 10
+                           training steps using MXFP8 and Nemotron3-Super Mamba hyperparameters.
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TransformerEngine >= 2.17", allow_module_level=True)
+
+from transformer_engine.pytorch import fp8_autocast
+
+from megatron.core.tensor_parallel.gtp import GTPShardedParam
+from tests.unit_tests.generalized_tensor_parallel.gtp_test_utils import (
+    _requires_mxfp8,
+    _run_distributed,
+    _torchrun_dist_init,
+    reset_fp8_state,
+    reset_gtp_globals,
+)
+
+# ---------------------------------------------------------------------------
+# Mamba GTP correctness: per-step loss trajectory baseline vs GTP=4
+# ---------------------------------------------------------------------------
+
+
+def _worker_mamba_gtp_correctness(rank, world_size, port):
+    """Verify GTP Mamba produces the same per-step loss as a no-GTP baseline.
+
+    Phase 1 — GTP=1, DP=4:
+        All 4 ranks hold the full model and process identical inputs.  Gradients
+        are identical across ranks (no all-reduce needed).  Weight update:
+            param.data -= lr * param.grad
+
+    Phase 2 — GTP=4, DP=1:
+        Weights sharded across 4 ranks.  After backward, wgrad reduce-scatter
+        sums each shard's identical wgrad over all ranks, so:
+            main_grad[rank_i] = gtp_size * dW[shard_i]
+        The optimizer divides by gtp_size to recover the per-element gradient:
+            param.data -= (lr / gtp_size) * param.main_grad
+
+    Both phases use identical initial weights (synced from rank 0 in phase 1,
+    restored as shards in phase 2) and identical step-by-step inputs.  The
+    per-step loss trajectories must agree within 0.1% relative error.
+    """
+    from transformer_engine.common.recipe import MXFP8BlockScaling
+    from transformer_engine.pytorch.quantization import FP8GlobalStateManager
+
+    from megatron.core import parallel_state as ps
+    from megatron.core.extensions.transformer_engine import (
+        TELayerNormColumnParallelLinear,
+        TERowParallelLinear,
+    )
+    from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.ssm.mamba_layer import MambaLayer, MambaLayerSubmodules
+    from megatron.core.ssm.mamba_mixer import MambaMixer, MambaMixerSubmodules
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+    from megatron.core.transformer.spec_utils import ModuleSpec
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    # Nemotron3-Super Proxy Mamba hyperparameters.
+    # in_proj_out = 2*8192 + 2*8*128 + 128 = 18560; 18560/4 = 4640, 4640%16 = 0 (MXFP8-aligned).
+    HIDDEN = 4096
+    NHEADS = 128  # mamba_num_heads; d_inner = nheads * headdim = 128 * 64 = 8192
+    NGROUPS = 8  # mamba_num_groups (default)
+    D_STATE = 128  # mamba_state_dim (default)
+    NUM_LAYERS = 2
+    SEQ = 32
+    BATCH = 1
+    LR = 0.01
+    STEPS = 10
+    dtype = torch.bfloat16
+    recipe = MXFP8BlockScaling()
+
+    def make_config():
+        return TransformerConfig(
+            num_attention_heads=32,
+            num_layers=NUM_LAYERS,
+            hidden_size=HIDDEN,
+            mamba_num_heads=NHEADS,
+            mamba_head_dim=64,
+            mamba_state_dim=D_STATE,
+            mamba_num_groups=NGROUPS,
+            use_mamba_mem_eff_path=True,
+            params_dtype=dtype,
+            hidden_dropout=0.0,
+            bias_dropout_fusion=False,
+            fp8='e4m3',
+            tensor_model_parallel_size=1,
+            pipeline_model_parallel_size=1,
+        )
+
+    def make_mamba_stack(config, pg_collection):
+        submodules = MambaLayerSubmodules(
+            mixer=ModuleSpec(
+                module=MambaMixer,
+                submodules=MambaMixerSubmodules(
+                    in_proj=TELayerNormColumnParallelLinear, out_proj=TERowParallelLinear
+                ),
+            ),
+            mamba_bda=get_bias_dropout_add,
+        )
+        return torch.nn.ModuleList(
+            [
+                MambaLayer(config, submodules, layer_number=i + 1, pg_collection=pg_collection)
+                for i in range(NUM_LAYERS)
+            ]
+        )
+
+    def run_step(layers, x):
+        with fp8_autocast(enabled=True, fp8_recipe=recipe):
+            for layer in layers:
+                x = layer(x)
+        return x.mean()
+
+    # -------------------------------------------------------------------------
+    # Phase 1: Baseline — GTP=1 (DP=4)
+    # -------------------------------------------------------------------------
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=1
+    )
+    model_parallel_cuda_manual_seed(42)
+
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    config = make_config()
+    layers = make_mamba_stack(config, pg_collection)
+    for layer in layers:
+        layer.cuda()
+
+    # Verify baseline has no GTP sharding (gtp_remat_size=1 should leave plain parameters).
+    assert not any(
+        isinstance(p, GTPShardedParam) for p in layers.parameters()
+    ), "Baseline GTP=1 stack should have no GTPShardedParam"
+
+    # Synchronize weights from rank 0 across all DP ranks.
+    for p in layers.parameters():
+        dist.broadcast(p.data, src=0)
+
+    # Save initial weights; will be used to initialize the GTP model identically.
+    saved_weights = {n: p.data.clone() for n, p in layers.named_parameters()}
+
+    baseline_losses = []
+    for step in range(STEPS):
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+
+        loss = run_step(layers, x)
+        if rank == 0:
+            baseline_losses.append(loss.item())
+
+        loss.backward()
+        with torch.no_grad():
+            for p in layers.parameters():
+                if p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+    FP8GlobalStateManager.reset()
+
+    # -------------------------------------------------------------------------
+    # Phase 2: GTP=4 (DP=1)
+    # -------------------------------------------------------------------------
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1, pipeline_model_parallel_size=1, gtp_remat_size=4
+    )
+    model_parallel_cuda_manual_seed(42)
+
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['tp', 'cp', 'gtp'])
+    config = make_config()
+    layers_gtp = make_mamba_stack(config, pg_collection)
+    for layer in layers_gtp:
+        layer.cuda()
+
+    gtp_group = ps.get_gtp_weight_remat_group()
+    gtp_size = gtp_group.size()
+    gtp_rank = gtp_group.rank()
+
+    # Verify GTP is truly active: at least one param must be a GTPShardedParam.
+    gtp_params = [p for p in layers_gtp.parameters() if isinstance(p, GTPShardedParam)]
+    assert len(gtp_params) > 0, "GTP is not active: no GTPShardedParam found in GTP=4 Mamba stack"
+
+    # Restore initial weights: GTP params get the matching shard, others get the full tensor.
+    for name, p in layers_gtp.named_parameters():
+        full = saved_weights[name]
+        if isinstance(p, GTPShardedParam):
+            shard_size = p.shape[0]
+            p.data.copy_(full[gtp_rank * shard_size : (gtp_rank + 1) * shard_size])
+        else:
+            p.data.copy_(full)
+
+    # Pre-allocate main_grad for GTP params (required before the first backward).
+    for p in layers_gtp.parameters():
+        if isinstance(p, GTPShardedParam):
+            p.main_grad = torch.zeros(p.shape, dtype=dtype, device='cuda')
+
+    gtp_losses = []
+    for step in range(STEPS):
+        for p in layers_gtp.parameters():
+            if isinstance(p, GTPShardedParam):
+                p.main_grad.zero_()
+
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+
+        loss = run_step(layers_gtp, x)
+        if rank == 0:
+            gtp_losses.append(loss.item())
+
+        loss.backward()
+
+        # After RS, main_grad = gtp_size * dW_shard (sum over ranks, all ranks hold the same
+        # full wgrad after all-gathering the weight in fwd).  Divide by gtp_size so the weight
+        # update is equivalent to the baseline.
+        with torch.no_grad():
+            for p in layers_gtp.parameters():
+                if isinstance(p, GTPShardedParam):
+                    p.data.sub_((LR / gtp_size) * p.main_grad)
+                elif p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    # -------------------------------------------------------------------------
+    # Compare per-step loss trajectories on rank 0
+    # -------------------------------------------------------------------------
+    if rank == 0:
+        assert len(baseline_losses) == STEPS
+        assert len(gtp_losses) == STEPS
+        for step, (lb, lg) in enumerate(zip(baseline_losses, gtp_losses)):
+            print(f"Step {step:2d}: baseline={lb:.6f}  gtp={lg:.6f}", flush=True)
+        torch.testing.assert_close(
+            torch.tensor(gtp_losses), torch.tensor(baseline_losses), atol=1e-5, rtol=1e-5
+        )
+
+
+class TestMambaGTPCorrectness:
+    def test_mamba_gtp_loss_trajectory_matches_baseline(self):
+        """GTP Mamba per-step losses must match no-GTP baseline (atol=1e-5, rtol=1e-5; MXFP8, Nemotron3-Super)."""
+        _requires_mxfp8()
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires at least 4 CUDA devices")
+        _run_distributed(_worker_mamba_gtp_correctness, 4)
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_moe_egtp.py b/tests/unit_tests/generalized_tensor_parallel/test_moe_egtp.py
new file mode 100644
index 00000000000..d949cabcdd2
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_moe_egtp.py
@@ -0,0 +1,291 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Integration tests for EGTP + MoE correctness.
+
+Test groups
+-----------
+TestMoEEGTPCorrectness  - EGTP MoE loss trajectory matches baseline (no-EGTP) over 10
+                          training steps using MXFP8 and Nemotron3-Super MoE hyperparameters.
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TransformerEngine >= 2.17", allow_module_level=True)
+
+from transformer_engine.pytorch import fp8_autocast
+
+from megatron.core.tensor_parallel.gtp import GTPShardedParam
+from megatron.core.transformer.moe.moe_utils import get_default_pg_collection
+from tests.unit_tests.generalized_tensor_parallel.gtp_test_utils import (
+    _requires_mxfp8,
+    _run_distributed,
+    _torchrun_dist_init,
+    reset_fp8_state,
+    reset_gtp_globals,
+)
+
+# ---------------------------------------------------------------------------
+# MoE EGTP correctness: per-step loss trajectory EP=4 baseline vs EP=2+EGTP=2
+# ---------------------------------------------------------------------------
+
+
+def _worker_moe_egtp_correctness(rank, world_size, port):
+    """Verify EP=2+EGTP=2 MoE produces the same per-step loss as an EP=4 no-EGTP baseline.
+
+    Phase 1 — EP=4, EGTP=1:
+        All 4 ranks form one EP group; each rank holds 2 full expert weights (8 total).
+        All ranks receive the same MoE-layer input; alltoall dispatch routes each token
+        to its assigned expert rank, so each rank computes a different token subset.
+        Gradients are local to each expert's rank.  Weight update:
+            param.data -= lr * param.grad
+
+    Phase 2 — EP=2, EGTP=2:
+        Two EP groups of 2 ranks, each EGTP-sharded over 2 ranks.  Expert weights
+        are sharded along dim 0 within each EGTP group (shard = full_dim0 / egtp_size).
+        After backward, wgrad reduce-scatter sums each shard's identical wgrad:
+            main_grad[rank_i] = egtp_size * dW[shard_i]
+        The optimizer divides by egtp_size:
+            param.data -= (lr / egtp_size) * param.main_grad
+
+    Weight sharing (test-only):
+        To ensure both phases start from identical expert weights, an all-gather
+        collects the full 8-expert table from the EP=4 group (where each rank holds
+        only 2 experts) onto every rank.  Phase 2 then slices each rank's local
+        experts and EGTP shard from that global table.
+
+    Nemotron3-Super Proxy MoE hyperparameters (scaled for unit-test speed):
+        hidden=4096, ffn_hidden_size=2688, num_experts=8, topk=2
+    MXFP8 alignment with EGTP=2:
+        2688/2=1344, 1344%16=0 (fc1 shard); 4096/2=2048, 2048%16=0 (fc2 shard)
+    """
+    from transformer_engine.common.recipe import MXFP8BlockScaling
+    from transformer_engine.pytorch.quantization import FP8GlobalStateManager
+
+    from megatron.core import parallel_state as ps
+    from megatron.core.models.gpt.moe_module_specs import get_moe_module_spec
+    from megatron.core.process_groups_config import ProcessGroupCollection
+    from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed
+    from megatron.core.transformer.transformer_config import TransformerConfig
+
+    # Nemotron3-Super MoE hyperparameters (num_experts scaled from 512 to 8 for test speed).
+    HIDDEN = 4096
+    FFN_HIDDEN = 2688
+    NUM_EXPERTS = 8
+    TOPK = 2
+    SEQ = 32
+    BATCH = 1
+    LR = 0.01
+    STEPS = 10
+    dtype = torch.bfloat16
+    recipe = MXFP8BlockScaling()
+
+    def make_config():
+        return TransformerConfig(
+            num_attention_heads=32,
+            num_layers=1,
+            hidden_size=HIDDEN,
+            num_moe_experts=NUM_EXPERTS,
+            moe_router_topk=TOPK,
+            moe_ffn_hidden_size=FFN_HIDDEN,
+            moe_grouped_gemm=True,
+            moe_token_dispatcher_type="alltoall",
+            moe_aux_loss_coeff=0.0,
+            add_bias_linear=False,
+            params_dtype=dtype,
+            hidden_dropout=0.0,
+            bias_dropout_fusion=False,
+            fp8='e4m3',
+            tensor_model_parallel_size=1,
+            pipeline_model_parallel_size=1,
+        )
+
+    def make_moe_layer(config, pg_collection):
+        moe_spec = get_moe_module_spec(use_te=True, num_experts=NUM_EXPERTS, moe_grouped_gemm=True)
+        return moe_spec(config, layer_number=1, pg_collection=pg_collection)
+
+    def run_step(layer, x):
+        with fp8_autocast(enabled=True, fp8_recipe=recipe):
+            output, _ = layer(x)
+        return output.mean()
+
+    # -------------------------------------------------------------------------
+    # Phase 1: Baseline — EP=4, EGTP=1 (DP=1)
+    # -------------------------------------------------------------------------
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+        expert_model_parallel_size=4,
+        expert_gtp_remat_size=1,
+    )
+    model_parallel_cuda_manual_seed(42)
+
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['ep'])
+    ep_group = pg_collection.ep
+    num_local_experts_baseline = NUM_EXPERTS // 4  # = 2
+
+    config = make_config()
+    layer = make_moe_layer(config, None)  # MoELayer uses get_default_pg_collection()
+    layer.cuda()
+
+    # Verify baseline has no GTP sharding (EGTP=1 should leave plain parameters).
+    assert not any(
+        isinstance(p, GTPShardedParam) for p in layer.parameters()
+    ), "Baseline EP=4 layer should have no GTPShardedParam (EGTP=1)"
+
+    # Synchronize non-expert weights from rank 0; expert weights are rank-local.
+    for name, p in layer.named_parameters():
+        if 'linear_fc1.weight' not in name and 'linear_fc2.weight' not in name:
+            dist.broadcast(p.data, src=0)
+
+    # Collect the full expert weight table so Phase 2 can restore identical init weights.
+    # EP=4: each rank holds 2 experts; all-gather gives every rank the complete [8, dim, ...] table.
+    local_fc1 = torch.stack(
+        [
+            dict(layer.named_parameters())[f'experts.linear_fc1.weight{i}'].data
+            for i in range(num_local_experts_baseline)
+        ]
+    )  # [2, FFN_HIDDEN, HIDDEN]
+    global_fc1 = torch.zeros(NUM_EXPERTS, FFN_HIDDEN, HIDDEN, dtype=dtype, device='cuda')
+    dist.all_gather_into_tensor(global_fc1, local_fc1, group=ep_group)
+
+    local_fc2 = torch.stack(
+        [
+            dict(layer.named_parameters())[f'experts.linear_fc2.weight{i}'].data
+            for i in range(num_local_experts_baseline)
+        ]
+    )  # [2, HIDDEN, FFN_HIDDEN]
+    global_fc2 = torch.zeros(NUM_EXPERTS, HIDDEN, FFN_HIDDEN, dtype=dtype, device='cuda')
+    dist.all_gather_into_tensor(global_fc2, local_fc2, group=ep_group)
+
+    # Save non-expert param values (router, norms, etc.) from rank 0.
+    non_expert_weights = {}
+    for name, p in layer.named_parameters():
+        if 'linear_fc1.weight' not in name and 'linear_fc2.weight' not in name:
+            non_expert_weights[name] = p.data.clone()
+
+    baseline_losses = []
+    for step in range(STEPS):
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+
+        loss = run_step(layer, x)
+        if rank == 0:
+            baseline_losses.append(loss.item())
+
+        loss.backward()
+        with torch.no_grad():
+            for p in layer.parameters():
+                if p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    GTPShardedParam._chain_state = {}
+    FP8GlobalStateManager.reset()
+
+    # -------------------------------------------------------------------------
+    # Phase 2: EP=2, EGTP=2 (DP=1 effective)
+    # -------------------------------------------------------------------------
+    ps.initialize_model_parallel(
+        tensor_model_parallel_size=1,
+        pipeline_model_parallel_size=1,
+        expert_model_parallel_size=2,
+        expert_gtp_remat_size=2,
+    )
+    model_parallel_cuda_manual_seed(42)
+
+    pg_collection = ProcessGroupCollection.use_mpu_process_groups(required_pgs=['expt_gtp'])
+    egtp_group = pg_collection.expt_gtp
+    egtp_size = egtp_group.size()
+    egtp_rank = egtp_group.rank()
+    ep_rank_egtp = dist.get_rank(ps.get_expert_model_parallel_group())
+    num_local_experts_egtp = NUM_EXPERTS // 2  # = 4
+
+    config = make_config()
+    # Build full pg_collection for MoELayer: default groups + expt_gtp for EGTP sharding.
+    moe_pg = get_default_pg_collection()
+    moe_pg.expt_gtp = egtp_group
+    layer_egtp = make_moe_layer(config, moe_pg)
+    layer_egtp.cuda()
+
+    # Verify EGTP is truly active: expert weight params must be GTPShardedParam instances.
+    egtp_params = [p for p in layer_egtp.parameters() if isinstance(p, GTPShardedParam)]
+    assert len(egtp_params) > 0, "EGTP is not active: no GTPShardedParam found in EP=2+EGTP=2 layer"
+
+    # Restore weights from saved global tables.
+    # Expert local index j → global expert id = ep_rank_egtp * num_local_experts_egtp + j.
+    fc1_shard = FFN_HIDDEN // egtp_size  # 2688/2 = 1344
+    fc2_shard = HIDDEN // egtp_size  # 4096/2 = 2048
+    for name, p in layer_egtp.named_parameters():
+        if 'linear_fc1.weight' in name:
+            j = int(name.rsplit('weight', 1)[1])
+            gid = ep_rank_egtp * num_local_experts_egtp + j
+            p.data.copy_(global_fc1[gid, egtp_rank * fc1_shard : (egtp_rank + 1) * fc1_shard])
+        elif 'linear_fc2.weight' in name:
+            j = int(name.rsplit('weight', 1)[1])
+            gid = ep_rank_egtp * num_local_experts_egtp + j
+            p.data.copy_(global_fc2[gid, egtp_rank * fc2_shard : (egtp_rank + 1) * fc2_shard])
+        elif name in non_expert_weights:
+            p.data.copy_(non_expert_weights[name])
+
+    # Pre-allocate main_grad for EGTP params (required before the first backward).
+    for p in layer_egtp.parameters():
+        if isinstance(p, GTPShardedParam):
+            p.main_grad = torch.zeros(p.shape, dtype=dtype, device='cuda')
+
+    egtp_losses = []
+    for step in range(STEPS):
+        for p in layer_egtp.parameters():
+            if isinstance(p, GTPShardedParam):
+                p.main_grad.zero_()
+
+        torch.manual_seed(step)
+        x = torch.randn(SEQ, BATCH, HIDDEN, dtype=dtype, device='cuda')
+        dist.broadcast(x, src=0)
+
+        loss = run_step(layer_egtp, x)
+        if rank == 0:
+            egtp_losses.append(loss.item())
+
+        loss.backward()
+
+        # After RS, main_grad = egtp_size * dW_shard.  Divide by egtp_size to match baseline.
+        with torch.no_grad():
+            for p in layer_egtp.parameters():
+                if isinstance(p, GTPShardedParam):
+                    p.data.sub_((LR / egtp_size) * p.main_grad)
+                elif p.grad is not None:
+                    p.data.sub_(LR * p.grad)
+                    p.grad.zero_()
+
+    ps.destroy_model_parallel()
+    ps.initialize_model_parallel()
+    GTPShardedParam._chain_state = {}
+
+    # -------------------------------------------------------------------------
+    # Compare per-step loss trajectories on rank 0
+    # -------------------------------------------------------------------------
+    if rank == 0:
+        assert len(baseline_losses) == STEPS
+        assert len(egtp_losses) == STEPS
+        for step, (lb, le) in enumerate(zip(baseline_losses, egtp_losses)):
+            print(f"Step {step:2d}: baseline={lb:.6f}  egtp={le:.6f}", flush=True)
+        torch.testing.assert_close(
+            torch.tensor(egtp_losses), torch.tensor(baseline_losses), atol=1e-5, rtol=1e-5
+        )
+
+
+class TestMoEEGTPCorrectness:
+    def test_moe_egtp_loss_trajectory_matches_baseline(self):
+        """EP=2+EGTP=2 MoE per-step losses must match EP=4 baseline: atol=1e-5, rtol=1e-5; MXFP8"""
+        _requires_mxfp8()
+        if torch.cuda.device_count() < 4:
+            pytest.skip("Requires at least 4 CUDA devices")
+        _run_distributed(_worker_moe_egtp_correctness, 4)
diff --git a/tests/unit_tests/generalized_tensor_parallel/test_tp_gtp.py b/tests/unit_tests/generalized_tensor_parallel/test_tp_gtp.py
new file mode 100644
index 00000000000..2c206d63be6
--- /dev/null
+++ b/tests/unit_tests/generalized_tensor_parallel/test_tp_gtp.py
@@ -0,0 +1,325 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+"""Unit tests for combined Tensor Parallelism + Generalized Tensor Parallelism (TP+GTP).
+
+Process group layout (world_size = tp_size x gtp_size):
+
+    rank = gtp_rank x tp_size + tp_rank
+
+    TP  group: all ranks that share the same gtp_rank  (size = tp_size)
+    GTP group: all ranks that share the same tp_rank   (size = gtp_size)
+
+Test groups
+-----------
+1.  TestTPGTPProcessGroups        - verify TP/GTP group sizes and rank assignment
+2.  TestTPGTPColumnParallelLinear - column-parallel Linear: fwd/bwd correctness (weight shape verified inline)
+3.  TestTPGTPRowParallelLinear    - row-parallel Linear: fwd/bwd smoke test + numerical correctness
+4.  TestTPGTPLayerNormLinear      - LayerNormLinear column-parallel smoke test
+
+Tests use (tp_size, gtp_size) = (2, 2) → world_size = 4 (runs on 4-GPU machines).
+
+Multi-GPU tests skip automatically when ``torch.distributed.get_world_size()`` does not match
+the requested combination of tp_size x gtp_size.
+"""
+
+import pytest
+import torch
+import torch.distributed as dist
+
+from megatron.core.tensor_parallel.gtp import HAVE_GTP
+
+if not HAVE_GTP:
+    pytest.skip("GTP requires TransformerEngine >= 2.17", allow_module_level=True)
+
+import transformer_engine.pytorch as te
+
+from megatron.core.tensor_parallel.gtp import GTPShardedParam
+from tests.unit_tests.generalized_tensor_parallel.gtp_test_utils import (
+    _make_gtp_linear,
+    _requires_multi_gpu,
+    _run_distributed,
+    _torchrun_dist_init,
+    reset_fp8_state,
+    reset_gtp_globals,
+)
+
+
+def _build_groups(rank: int, world_size: int, tp_size: int, gtp_size: int):
+    """Create TP and GTP process groups for a 2D parallelism grid.
+
+    Layout: rank = gtp_rank x tp_size + tp_rank
+      TP  group: contiguous block [gtp_rank*tp_size, (gtp_rank+1)*tp_size)
+      GTP group: strided set      {tp_rank, tp_rank+tp_size, tp_rank+2*tp_size, ...}
+
+    Every rank must call new_group for ALL groups (PyTorch distributed requirement).
+
+    Returns:
+        tp_group:  this rank's TP process group
+        gtp_group: this rank's GTP process group
+        tp_rank:   this rank's index within its TP group
+        gtp_rank:  this rank's index within its GTP group
+    """
+    assert tp_size * gtp_size == world_size
+    tp_rank = rank % tp_size
+    gtp_rank = rank // tp_size
+
+    tp_group = None
+    for er in range(gtp_size):
+        ranks = list(range(er * tp_size, (er + 1) * tp_size))
+        grp = dist.new_group(ranks)
+        if er == gtp_rank:
+            tp_group = grp
+
+    gtp_group = None
+    for tr in range(tp_size):
+        ranks = list(range(tr, world_size, tp_size))
+        grp = dist.new_group(ranks)
+        if tr == tp_rank:
+            gtp_group = grp
+
+    return tp_group, gtp_group, tp_rank, gtp_rank
+
+
+# ---------------------------------------------------------------------------
+# 1. TestTPGTPProcessGroups - group sizes and rank membership
+# ---------------------------------------------------------------------------
+
+
+def _worker_groups(rank, world_size, port, tp_size, gtp_size):
+    tp_group, gtp_group, tp_rank, gtp_rank = _build_groups(rank, world_size, tp_size, gtp_size)
+
+    assert tp_group.size() == tp_size, f"rank {rank}: TP group size {tp_group.size()} != {tp_size}"
+    assert (
+        gtp_group.size() == gtp_size
+    ), f"rank {rank}: GTP group size {gtp_group.size()} != {gtp_size}"
+    assert (
+        dist.get_rank(tp_group) == tp_rank
+    ), f"rank {rank}: TP rank {dist.get_rank(tp_group)} != expected {tp_rank}"
+    assert (
+        dist.get_rank(gtp_group) == gtp_rank
+    ), f"rank {rank}: GTP rank {dist.get_rank(gtp_group)} != expected {gtp_rank}"
+
+
+class TestTPGTPProcessGroups:
+    @pytest.mark.parametrize("tp_size,gtp_size", [(2, 2)])
+    def test_group_sizes_and_ranks(self, tp_size, gtp_size):
+        world_size = tp_size * gtp_size
+        _requires_multi_gpu(world_size)
+        _run_distributed(_worker_groups, world_size, tp_size, gtp_size)
+
+
+# ---------------------------------------------------------------------------
+# 2. TestTPGTPColumnParallelLinear
+# ---------------------------------------------------------------------------
+
+
+def _worker_column_correctness(rank, world_size, port, tp_size, gtp_size):
+    """Column-parallel output must equal inp @ (GTP-gathered TP-local weight)^T."""
+    torch.manual_seed(0)
+    tp_group, gtp_group, tp_rank, gtp_rank = _build_groups(rank, world_size, tp_size, gtp_size)
+
+    batch, in_f = 16, 64
+    out_f = tp_size * gtp_size * 32  # per-rank shard = 32 rows
+    dtype = torch.bfloat16
+
+    layer = _make_gtp_linear(
+        in_f, out_f, gtp_group, dtype, parallel_mode="column", tp_group=tp_group
+    )
+
+    # All-gather GTP shards → TP-local full weight [out_f/tp_size, in_f]
+    shard = layer.weight.data.clone()
+    all_gtp_shards = [torch.zeros_like(shard) for _ in range(gtp_size)]
+    dist.all_gather(all_gtp_shards, shard, group=gtp_group)
+    tp_local_weight = torch.cat(all_gtp_shards, dim=0).float()  # strip padding
+    tp_local_weight = tp_local_weight[: out_f // tp_size]
+
+    # Same full input on all ranks (column-parallel: each rank processes full input)
+    inp = torch.randn(batch, in_f, dtype=dtype, device="cuda")
+    dist.broadcast(inp, src=0)
+    inp_te = inp.clone().requires_grad_(True)
+
+    # TE forward: GTP all-gathers weight internally; no TP comm in column-parallel fwd
+    out = layer(inp_te, is_first_microbatch=True)
+    assert out.shape == (
+        batch,
+        out_f // tp_size,
+    ), f"rank {rank}: output shape {out.shape} != ({batch}, {out_f // tp_size})"
+
+    # Reference: this TP rank's output = inp @ tp_local_weight^T
+    ref = inp.float() @ tp_local_weight.T
+    ref = ref.to(dtype)
+    assert torch.allclose(
+        out.float(), ref.float(), atol=1e-2, rtol=1e-2
+    ), f"rank {rank}: output mismatch, max_diff={(out.float() - ref.float()).abs().max():.4f}"
+
+    # Backward: dX is all-reduced across TP group internally by TE
+    grad = torch.randn_like(out)
+    dist.broadcast(grad, src=0)
+    # wgrad RS path always accumulates into main_grad; allocate before backward.
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+    out.backward(grad)
+    assert inp_te.grad is not None and inp_te.grad.shape == inp.shape
+    assert torch.isfinite(inp_te.grad).all(), f"rank {rank}: non-finite dX"
+
+
+class TestTPGTPColumnParallelLinear:
+    @pytest.mark.parametrize("tp_size,gtp_size", [(2, 2)])
+    def test_forward_backward_correctness(self, tp_size, gtp_size):
+        world_size = tp_size * gtp_size
+        _requires_multi_gpu(world_size)
+        _run_distributed(_worker_column_correctness, world_size, tp_size, gtp_size)
+
+
+# ---------------------------------------------------------------------------
+# 3. TestTPGTPRowParallelLinear
+# ---------------------------------------------------------------------------
+
+
+def _worker_row_forward_backward(rank, world_size, port, tp_size, gtp_size):
+    """Row-parallel: weight shape verified; output is all-reduced [batch, out_f]; backward produces finite dX."""
+    torch.manual_seed(0)
+    tp_group, gtp_group, tp_rank, _ = _build_groups(rank, world_size, tp_size, gtp_size)
+
+    batch = 16
+    in_f = tp_size * 64  # full in_features
+    out_f = gtp_size * 64  # full out_features
+    dtype = torch.bfloat16
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype, parallel_mode="row", tp_group=tp_group)
+
+    expected_shape = (out_f // gtp_size, in_f // tp_size)
+    assert isinstance(
+        layer.weight, GTPShardedParam
+    ), f"rank {rank}: weight should be GTPShardedParam"
+    assert (
+        layer.weight.shape == expected_shape
+    ), f"rank {rank}: expected {expected_shape}, got {layer.weight.shape}"
+
+    # Row-parallel: each TP rank takes the corresponding slice of in_f
+    full_inp = torch.randn(batch, in_f, dtype=dtype, device="cuda")
+    dist.broadcast(full_inp, src=0)
+    local_in_f = in_f // tp_size
+    inp = full_inp[:, tp_rank * local_in_f : (tp_rank + 1) * local_in_f]
+    inp = inp.clone().requires_grad_(True)
+
+    # TE forward: GTP all-gathers weight, row-parallel all-reduces output across TP
+    out = layer(inp, is_first_microbatch=True)
+    assert out.shape == (
+        batch,
+        out_f,
+    ), f"rank {rank}: output shape {out.shape} != ({batch}, {out_f})"
+    assert torch.isfinite(out).all(), f"rank {rank}: non-finite output"
+
+    # wgrad RS path always accumulates into main_grad; allocate before backward.
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+    out.sum().backward()
+    assert inp.grad is not None and inp.grad.shape == inp.shape
+    assert torch.isfinite(inp.grad).all(), f"rank {rank}: non-finite dX"
+
+
+def _worker_row_correctness(rank, world_size, port, tp_size, gtp_size):
+    """Row-parallel all-reduced output must equal inp_full @ full_weight^T."""
+    torch.manual_seed(0)
+    tp_group, gtp_group, tp_rank, _ = _build_groups(rank, world_size, tp_size, gtp_size)
+
+    batch = 16
+    in_f = tp_size * 64
+    out_f = gtp_size * 64
+    dtype = torch.bfloat16
+
+    layer = _make_gtp_linear(in_f, out_f, gtp_group, dtype, parallel_mode="row", tp_group=tp_group)
+
+    # Reconstruct full weight: all-gather GTP shards → TP-local, then all-gather TP shards
+    shard = layer.weight.data.clone()
+    all_gtp_shards = [torch.zeros_like(shard) for _ in range(gtp_size)]
+    dist.all_gather(all_gtp_shards, shard, group=gtp_group)
+    tp_local_weight = torch.cat(all_gtp_shards, dim=0).float()  # [out_f, in_f/tp_size]
+
+    all_tp_weights = [torch.zeros_like(tp_local_weight) for _ in range(tp_size)]
+    dist.all_gather(all_tp_weights, tp_local_weight, group=tp_group)
+    full_weight = torch.cat(all_tp_weights, dim=1).float()  # [out_f, in_f]
+
+    # Full input (same on all ranks; we slice below to simulate row-parallel)
+    full_inp = torch.randn(batch, in_f, dtype=dtype, device="cuda")
+    dist.broadcast(full_inp, src=0)
+    local_in_f = in_f // tp_size
+    inp = full_inp[:, tp_rank * local_in_f : (tp_rank + 1) * local_in_f].clone()
+    inp.requires_grad_(True)
+
+    out = layer(inp, is_first_microbatch=True)
+
+    # Reference: full input @ full weight^T — all ranks should see the same output
+    ref = full_inp.float() @ full_weight.T
+    ref = ref.to(dtype)
+    assert torch.allclose(
+        out.float(), ref.float(), atol=2e-2, rtol=1e-2
+    ), f"rank {rank}: output mismatch, max_diff={(out.float() - ref.float()).abs().max():.4f}"
+
+
+class TestTPGTPRowParallelLinear:
+    @pytest.mark.parametrize("tp_size,gtp_size", [(2, 2)])
+    def test_forward_backward(self, tp_size, gtp_size):
+        world_size = tp_size * gtp_size
+        _requires_multi_gpu(world_size)
+        _run_distributed(_worker_row_forward_backward, world_size, tp_size, gtp_size)
+
+    @pytest.mark.parametrize("tp_size,gtp_size", [(2, 2)])
+    def test_forward_correctness(self, tp_size, gtp_size):
+        world_size = tp_size * gtp_size
+        _requires_multi_gpu(world_size)
+        _run_distributed(_worker_row_correctness, world_size, tp_size, gtp_size)
+
+
+# ---------------------------------------------------------------------------
+# 4. TestTPGTPLayerNormLinear - column-parallel smoke test
+# ---------------------------------------------------------------------------
+
+
+def _worker_layernorm_linear(rank, world_size, port, tp_size, gtp_size):
+    torch.manual_seed(0)
+    tp_group, gtp_group, _, _ = _build_groups(rank, world_size, tp_size, gtp_size)
+
+    seq, batch = 4, 2
+    in_f = 64
+    out_f = tp_size * gtp_size * 32
+    dtype = torch.bfloat16
+
+    layer = te.LayerNormLinear(
+        in_features=in_f,
+        out_features=out_f,
+        bias=False,
+        params_dtype=dtype,
+        parallel_mode="column",
+        device="cuda",
+        tp_group=tp_group,
+        gtp_group=gtp_group,
+    )
+    assert isinstance(
+        layer.weight, GTPShardedParam
+    ), f"rank {rank}: LayerNormLinear.weight should be GTPShardedParam"
+    expected_rows = out_f // (tp_size * gtp_size)
+    assert layer.weight.shape == (
+        expected_rows,
+        in_f,
+    ), f"rank {rank}: unexpected weight shape {layer.weight.shape}"
+
+    inp = torch.randn(seq, batch, in_f, dtype=dtype, device="cuda", requires_grad=True)
+    dist.broadcast(inp, src=0)
+
+    out = layer(inp, is_first_microbatch=True)
+    assert out.shape == (seq, batch, out_f // tp_size), f"rank {rank}: output shape {out.shape}"
+    assert torch.isfinite(out).all(), f"rank {rank}: non-finite output"
+
+    # wgrad RS path always accumulates into main_grad; allocate before backward.
+    layer.weight.main_grad = torch.zeros(layer.weight.shape, dtype=dtype, device="cuda")
+    out.sum().backward()
+    assert inp.grad is not None and inp.grad.shape == inp.shape
+    assert torch.isfinite(inp.grad).all(), f"rank {rank}: non-finite dX"
+
+
+class TestTPGTPLayerNormLinear:
+    @pytest.mark.parametrize("tp_size,gtp_size", [(2, 2)])
+    def test_forward_backward(self, tp_size, gtp_size):
+        world_size = tp_size * gtp_size
+        _requires_multi_gpu(world_size)
+        _run_distributed(_worker_layernorm_linear, world_size, tp_size, gtp_size)
diff --git a/tests/unit_tests/models/test_hybrid_moe_model.py b/tests/unit_tests/models/test_hybrid_moe_model.py
index ec0e79d77ef..02513142c8a 100644
--- a/tests/unit_tests/models/test_hybrid_moe_model.py
+++ b/tests/unit_tests/models/test_hybrid_moe_model.py
@@ -95,6 +95,8 @@
     "ep_overlap_early_attn_memory_release": False,
     "experimental_attention_variant": None,
     "expert_model_parallel_size": 4,
+    "expert_gtp_weight_remat_size": 1,
+    "expert_tensor_parallel_num_weight_shards": 1,
     "expert_tensor_parallel_size": 1,
     "external_cuda_graph": False,
     "ffn_hidden_size": 1856,
@@ -122,6 +124,7 @@
     "fused_residual_rmsnorm": False,
     "fused_single_qkv_rope": False,
     "gated_linear_unit": False,
+    "gtp_weight_remat_size": 1,
     "glu_linear_offset": 0.0,
     "grad_scale_func": None,
     "mtp_grad_scale_func": None,
@@ -266,6 +269,7 @@
     "softmax_type": "vanilla",
     "symmetric_ar_type": None,
     "tensor_model_parallel_size": 2,
+    "tensor_parallel_num_weight_shards": 2,
     "test_mode": False,
     "timers": None,
     "tp_comm_atomic_ag": False,
diff --git a/tests/unit_tests/test_process_groups_config.py b/tests/unit_tests/test_process_groups_config.py
index b49962b1a5a..a61936bd132 100644
--- a/tests/unit_tests/test_process_groups_config.py
+++ b/tests/unit_tests/test_process_groups_config.py
@@ -29,7 +29,7 @@ def test_transformer_process_groups(self, mocker):
         # Test attribute existence
         assert hasattr(model_pgs, 'tp')
         assert hasattr(model_pgs, 'pp')
-        assert not hasattr(model_pgs, 'cp')  # Not set yet
+        assert model_pgs.cp is None  # Not set yet
 
     def test_grad_comm_process_groups(self, mocker):
         """Test basic functionality of ProcessGroupCollection."""
@@ -47,7 +47,7 @@ def test_grad_comm_process_groups(self, mocker):
 
         # Test attribute existence
         assert hasattr(grad_pgs, 'dp')
-        assert not hasattr(grad_pgs, 'dp_cp')  # Not set yet
+        assert grad_pgs.dp_cp is None  # Not set yet
 
     def test_hierarchical_context_parallel_groups(self, mocker):
         """Test setting and accessing the hierarchical context parallel list."""
@@ -129,7 +129,7 @@ def test_default_initialization(self):
         assert hasattr(model_pgs, 'tp')
         assert hasattr(model_pgs, 'pp')
         assert hasattr(model_pgs, 'cp')
-        assert not hasattr(model_pgs, 'dp')
+        assert model_pgs.dp is None  # Not requested, so not set
 
         # Test that an error is raised if an invalid process group is requested
         with pytest.raises(ValueError, match=r"Invalid process groups requested"):