Skip to content

[feat] Generalized Tensor Parallelism (GTP)#4967

Open
fanshiqing wants to merge 64 commits into
NVIDIA:mainfrom
fanshiqing:gtp_release
Open

[feat] Generalized Tensor Parallelism (GTP)#4967
fanshiqing wants to merge 64 commits into
NVIDIA:mainfrom
fanshiqing:gtp_release

Conversation

@fanshiqing

@fanshiqing fanshiqing commented May 25, 2026

Copy link
Copy Markdown
Member
  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Generalized Tensor Parallelism (GTP) is a light-weight, high-performance and memory-efficient distributed training strategy implemented in Megatron-LM and TransformerEngine. It shards weight tensors across an GTP process group and reconstructs them on-demand via async all-gather, enabling training of larger models without sacrificing throughput by overlapping communication with computation.

GPT's Architecture (Mcore + TE)

① Mcore registers callbacks into TE at import time.
② TE calls back into Mcore runtime during te.Linear(gtp_group=…) init AND during fwd/bwd (weight.all_gather_and_prefetch / wgrad_reduce_scatter).
③ Mcore extensions forward gtp_group= at module init.
④ TE provides MXFP8 / NVFP4 tensor types AND the quantize-then-AG / RS collectives (gather_along_first_dim, reduce_scatter_along_first_dim) — imported by Mcore runtime; GTP wraps them with its own schedule, buffer cache, and stream choreography.

image

Changes Summary

image

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

Co-authored-by: Jieming Zhang <jiemingz@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing fanshiqing requested review from a team as code owners May 25, 2026 07:25
@copy-pr-bot

copy-pr-bot Bot commented May 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft May 25, 2026 07:25
@github-actions

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing fanshiqing marked this pull request as ready for review May 25, 2026 08:11
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 25, 2026 08:12
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing fanshiqing changed the title Generalized Tensor Parallelism (GTP) [feat] Generalized Tensor Parallelism (GTP) May 25, 2026
@fanshiqing

Copy link
Copy Markdown
Member Author

/claude review

Comment thread megatron/core/extensions/transformer_engine.py Outdated
Comment thread megatron/core/distributed/finalize_model_grads.py Outdated
Comment thread megatron/core/distributed/finalize_model_grads.py Outdated
Comment thread megatron/core/distributed/finalize_model_grads.py Outdated
- DDP: route the backward post-hook for GTP params through
  register_grad_accum_hook and skip the autograd AccumulateGrad hook, so
  grad-ready fires only after the GTP wgrad add.
- GTP finalize path (_wait_reduce_scatter, finalize_grad=True): fire
  _handle_megatron_grad_accum after the add so terminal/async-only weights are
  not orphaned once the autograd path is suppressed.

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing

Copy link
Copy Markdown
Member Author

/ok to test c33667a

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing

Copy link
Copy Markdown
Member Author

/claude review

Comment thread megatron/core/process_groups_config.py

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large, well-documented feature PR with comprehensive test coverage (9 new test files). The architecture is sound — GTP/EGTP are correctly integrated as orthogonal parallelism axes with proper fallbacks when inactive.

One bug flagged inline: the new __getattr__ on ProcessGroupCollection breaks the hasattr-based fallback checks for the new GTP-related groups when a custom ProcessGroupCollection is constructed without those fields (backward-compat path). The fix is straightforward — use name in vars(pg_collection) instead of hasattr for the optional-group fallbacks, consistent with how the required-group checks were already updated.

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
…an-readable dtype in GTP weight cache

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing

This comment was marked as outdated.

@fanshiqing

This comment was marked as outdated.

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Make megatron.core self-contained: it must not import from
megatron.experimental, which is not shipped with the core wheel.

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
…unner.stream fence

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
…ction

fix2: make GTP module import gracefully without TransformerEngine

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing

fanshiqing commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

@ericharper @jaredcasper Hey Eric and Jared, can you help talk a look for this MR? Let me know if you have any concern for this MR so that I can change accordingly~

Any comments are welcome!

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@fanshiqing

Copy link
Copy Markdown
Member Author

/ok to test 00c9d20

@@ -1191,6 +1282,7 @@ def __init__(
tp_comm_buffer_name: str | None = None, # Not used
tp_group: Optional[torch.distributed.ProcessGroup] = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make tp_group a list of [tp_group, gtp_remat_group].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants