fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers by yuchenwang3 · Pull Request #5395 · NVIDIA/Megatron-LM

yuchenwang3 · 2026-06-17T20:47:11Z

Background. Encountered while running ms-swift + Megatron-Core SFT of Qwen3.5-35B-A3B (GatedDeltaNet hybrid, optimizer=dist_muon) on 16× B200 across 2 nodes (2×8). Full root-cause write-up in #5394.

Scope & root cause (optimizer-agnostic)

The underlying failure is not Muon-specific. A benign-but-large global grad_norm (here ~5e7 from the GatedDeltaNet layers, growing to ~2e11) makes the clip coefficient c = clip_grad / grad_norm ≈ 2e-8, which scales per-matrix gradients below the optimizer's numerical floor and freezes updates. The same condition stalls AdamW too (its eps floor) — i.e. the clip, not the optimizer, is the root cause. This is now confirmed directly (see Testing): AdamW with clip_grad=1.0 stalls at ~0.5, while AdamW with clip_grad=0 overfits a fixed batch to 0.029 (vs Muon's 0.019). (We used dist_muon here following the model's training recipe; it is not required — it is simply naturally immune to this clip, since magnitude clipping is a no-op for an orthogonalizing update.)

This PR fixes the case that has a clean, side-effect-free fix: for an orthogonalizing optimizer, magnitude clipping is semantically meaningless — Newton–Schulz discards gradient magnitude — so skipping it is unambiguously correct. There is no equally clean "skip clip" for a magnitude-based optimizer (it legitimately uses clipping for spike protection); the optimizer-agnostic alternative is to clamp the clip coefficient so a benign-large grad_norm can't drive gradients under the eps floor (listed below).

The harmful path, concretely

ChainedOptimizer.step clips every non-stub sub-optimizer's main_params with the global grad_norm whenever optimizer.config.clip_grad > 0. Applied to a Muon sub-optimizer, the tiny c ≈ 2e-8 pushes its per-matrix gradients below Newton–Schulz's internal F.normalize floor, so the orthogonalization silently degenerates (near-zero, non-orthogonal update) and training stalls — with the forward/loss looking completely normal.

Why not a config-level fix

Setting clip_grad=0 only for the Muon sub-optimizer is not possible via config, because ChainedOptimizer requires all sub-optimizers to share the same OptimizerConfig (the DistOpt+LayerWise path even resets result.config = config to satisfy this). So this PR uses a per-optimizer attribute instead.

Change

ChainedOptimizer.step: skip the norm-clip for a sub-optimizer carrying skip_grad_norm_clip = True (defaults to False via getattr, so all existing optimizers are unaffected). grad_norm is still computed and returned for logging; only the clip is skipped, and only for the flagged (Muon) sub-optimizer — Adam sub-optimizers still clip normally.
get_megatron_optimizer / _build_emerging_optimizer: set skip_grad_norm_clip = True on the emerging (Muon) optimizer — both the standard-chain wrapper and the LayerWiseDistributedOptimizer (which owns the Muon-managed matrix params for dist_muon).

Testing

The runtime effect — not magnitude-clipping the Muon param group — was validated on the actual Qwen3.5-35B-A3B (GatedDeltaNet) model on 16× B200 across 2 nodes (2×8), optimizer=dist_muon:

Muon group's gradient clipping off (equivalent to this PR; in our run via clip_grad=0): a fixed 32-example batch overfits to ~0.02, real-data training learns.
Global clip on (clip_grad=1.0, current behavior): training stalls (loss flat ~0.5, grad_norm 7.5e7 → 2.2e11).
Dual single-variable control: keeping clip_grad=1.0 but only lowering Newton–Schulz eps 1e-7→1e-30 also unblocks learning, confirming the clip → NS-eps interaction.
Optimizer-agnostic confirmation (optimizer=adam, fixed 64-example overfit, 200 steps): clip_grad=0 → loss 0.69 → 0.029 (vs Muon clip_grad=0 → 0.019), while clip_grad=1.0 stalls at ~0.5. grad_norm stayed 1e8–3e12 throughout with no NaN — Adam's per-parameter second-moment normalization digests the raw gradients fine once they are not crushed by the clip. So the clip is the sole cause across optimizers; Adam is viable without it, and Muon is not required for correctness (just lower/steadier).

This PR reproduces that validated behavior via the skip_grad_norm_clip attribute instead of a global clip_grad=0; the exact attribute wiring should additionally be exercised by CI.

Alternatives

A config/CLI flag (e.g. --no-clip-orthogonalizing-optimizers) instead of an implicit attribute.
Detect orthogonalizing optimizers by type rather than an attribute.
A more general fix that also helps magnitude-based optimizers: clamp the clip coefficient with a floor so a benign-but-large grad_norm can't drive per-matrix gradients under the optimizer's numerical eps. (Broader blast radius — affects all training — hence not chosen here.)
Pair with the related robustness fix so the orthogonalizer doesn't silently degenerate on small inputs: newton_schulz silently degenerates (non-orthogonal output) for small-norm inputs due to F.normalize eps floor NVIDIA-NeMo/Emerging-Optimizers#229 / PR fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229) NVIDIA-NeMo/Emerging-Optimizers#230.

Refs #5394

…timizers ChainedOptimizer.step magnitude-clips every sub-optimizer with config.clip_grad>0, including Muon. Muon's update is scale-invariant (Newton-Schulz discards gradient magnitude), so the clip is a no-op at best; at worst, a large global grad_norm makes the clip coefficient tiny and pushes per-matrix gradients below Newton-Schulz's normalization floor, silently degenerating the orthogonalization and stalling training. A config-level clip_grad=0 for the Muon group is not possible because ChainedOptimizer requires all sub-optimizers to share one OptimizerConfig. Use a per-optimizer 'skip_grad_norm_clip' attribute (default False via getattr) instead, set on the emerging (Muon) optimizer and the LayerWiseDistributedOptimizer that owns its params. Refs NVIDIA#5394 Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

copy-pr-bot · 2026-06-17T20:47:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ShauryaaSharma

Request Changes — the fix is directionally correct but has two blocking issues.

Bug 1 (Critical): The skip_grad_norm_clip flag is set on the layer_wise_optimizer
container, but when results is empty the training loop calls .step() directly on it.
The inherited ChainedOptimizer.step() then iterates the inner per-layer sub-optimizers —
none of which carry the flag — so clipping still fires. The fix is entirely bypassed for
this code path.

Bug 2 (High): The flag is applied unconditionally to every optimizer routed through
_get_megatron_emerging_optimizer, including SOAP and Lion. These are not scale-invariant
and should remain subject to clipping.

Two medium/low findings are also included as inline comments (grad_norm_skip_threshold
pollution and the should_clip wasted AllReduce).

@ShauryaaSharma

…ip-threshold norm Addresses @ShauryaaSharma's review on NVIDIA#5395: - Critical: set skip_grad_norm_clip on the raw orthogonalizing sub-optimizers and re-propagate onto LayerWiseDistributedOptimizer.chained_optimizers after construction (LayerWise re-wraps base opts in Float16OptimizerWithFloat16Params under bf16; wrappers don't forward attribute access, so the direct path needs the flag on the actual members). - High: gate the flag on isinstance(opt, OrthogonalizedOptimizer) so SOAP/Lion still clip; container flag now set only when every base sub is orthogonalizing. - Medium: _get_grad_norm_skip_threshold() excludes skip-flagged subs' grads from the norm used for grad_norm_skip_threshold (Muon's huge grad no longer trips skip for Adam subs); identical to get_grad_norm() when nothing is flagged. Scoped to the threshold check. - Low: should_clip predicate now also checks skip_grad_norm_clip (no wasted AllReduce). Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

…_skip_threshold The review-revision's _get_grad_norm_skip_threshold() unconditionally called get_grad_stats_parallel_group(), which asserts all chained sub-optimizers share one grad-stats group. On the distributed-optimizer path (Muon LayerWise sub + Adam DistributedOptimizer sub) the groups differ, so it raised AssertionError and broke every use_param_layout=True layer-wise test (regression vs base). Now compute sharedness over the NON-SKIP subs only (sidestepping the container assertion), return early for the all-skip (Muon-only -> 0.0) and single-non-skip cases, and fall back to combining per-sub norms when groups are non-shared -- mirroring get_grad_norm()'s own shared/non-shared handling. Found by on-node 8xH100 regression testing. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

…topt path) Covers the four review fixes and the distributed-optimizer regression: - B1: skip flag on the bf16-wrapper members of LayerWiseDistributedOptimizer .chained_optimizers (Muon-only container flagged; Adam sub in a mix unflagged). - B2: isinstance(OrthogonalizedOptimizer) gate (muon flagged, lion not). - B3: _get_grad_norm_skip_threshold() excludes flagged subs' grads; Adam update not wrongly skipped by a Muon sub's huge grad. - B4: should_clip short-circuits a Muon-only chain. - distopt path: get_megatron_optimizer with Muon LayerWise + Adam DistributedOptimizer steps without the shared-grad-stats-group assertion. Verified on an 8-GPU (sm90) node, torchrun nproc=2: 8 passed. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

yuchenwang3 · 2026-06-18T21:58:19Z

Thanks for the thorough review, @ShauryaaSharma — all four findings are addressed, and I verified the result on an 8-GPU (sm90/H100-class) node.

1 (Critical). The flag is now set on the raw orthogonalizing sub before it's appended to the LayerWise base list, and re-propagated onto layer_wise_optimizer.chained_optimizers after construction. The second step is required: LayerWiseDistributedOptimizer re-wraps each base optimizer in Float16OptimizerWithFloat16Params under bf16, and the wrappers don't forward attribute access, so the inherited ChainedOptimizer.step() (the results-empty direct path) reads the flag off the wrapper. Verified: a Muon-only bf16 build returns a bare LayerWiseDistributedOptimizer whose every chained member carries the flag; a Muon+Adam build leaves the Adam sub unflagged.

2 (High). Gated on isinstance(optimizer, OrthogonalizedOptimizer), computed on the raw optimizer. On emerging_optimizers 0.4.0a0 this hits exactly the Muon family (Muon, AdaptiveMuon, Scion, Spel, MOP, PolarGrad, MuonHyperball, SinkhornMuon); SOAP and Lion are not subclasses and keep clipping. Container flag set only when all base subs are orthogonalizing. Added a guarded OrthogonalizedOptimizer import with an empty-tuple sentinel so isinstance is safely False if the package is absent.

3 (Medium). Added _get_grad_norm_skip_threshold(), which excludes skip-flagged subs' grads from the norm used for the grad_norm_skip_threshold comparison (and exempts skip-flagged subs from the check). It returns get_grad_norm() unchanged when nothing is flagged, so no behavior change for non-Muon users. Scoped to the threshold check (the clip already short-circuits for skip-flagged subs). Verified: a Muon sub at grad_norm ≈ 6.2e6 + a well-behaved Adam sub → get_grad_norm()=6.2e6 vs _get_grad_norm_skip_threshold()=4.0e-4, so the Adam update is no longer wrongly skipped.

4 (Low). should_clip now also checks skip_grad_norm_clip, so a Muon-only chain no longer triggers _compute_grad_norms_by_group() (verified: 0 calls for a Muon-only chain, ≥1 for an Adam chain).

One issue I caught and fixed while testing (flagging for transparency). My first cut of _get_grad_norm_skip_threshold() called get_grad_stats_parallel_group() unconditionally, which asserts all chained subs share one grad-stats group. On the distributed-optimizer path (Muon LayerWiseDistributedOptimizer + Adam DistributedOptimizer) the groups differ, so it raised AssertionError and broke the use_param_layout=True layer-wise tests. The latest commit restores get_grad_norm()'s shared/non-shared handling — sharedness is judged over the non-skip subs only (sidestepping the container assertion), with a per-sub-norm fallback when non-shared and an early return for the Muon-only case. Those [True] tests are green again.

Verification (8-GPU node, torchrun, vs base ad5a93b). No new failures vs base: test_layer_wise_optimizer.py is back to 41/0 (the 15 [True] failures cleared), optimizer/ and test_lion_optimizer.py unchanged. (There are 27 pre-existing adaptive_muon/soap failures from an emerging_optimizers 0.4.0a0 API skew — identical on base and branch, unrelated to this PR.) I also added tests/unit_tests/optimizer/test_skip_grad_norm_clip.py covering the four behaviors above plus the distributed-optimizer path (8/8 pass).

The clip total_norm for non-skip subs still uses the full grad_norm — I scoped finding 3 to the threshold as you suggested; happy to narrow the clip norm too in a follow-up if you'd prefer. Ready for another look.

yuchenwang3 · 2026-06-23T21:43:51Z

Thanks again for the review and approval, @ShauryaaSharma 🙏 This still shows REVIEW_REQUIRED and the copy-pr-bot CI hasn't been triggered yet, so it can't merge as-is. Anything needed from my side — happy to rebase onto latest main? If it's just the CI / CODEOWNER step, could you /ok to test or loop in someone from the optimizer CODEOWNERS to help land it? Cheers.

github-actions Bot added the community-request label Jun 17, 2026

yuchenwang3 marked this pull request as ready for review June 17, 2026 20:52

yuchenwang3 requested review from a team as code owners June 17, 2026 20:52

svcnvidia-nemo-ci requested a review from a team June 17, 2026 20:53

This was referenced Jun 17, 2026

[BUG] ChainedOptimizer applies global grad-norm clipping to Muon (orthogonalizing) param groups, silently stalling training #5394

Open

fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229) NVIDIA-NeMo/Emerging-Optimizers#230

Merged

ShauryaaSharma suggested changes Jun 18, 2026

View reviewed changes

Comment thread megatron/core/optimizer/__init__.py Outdated

Comment thread megatron/core/optimizer/__init__.py Outdated

Comment thread megatron/core/optimizer/optimizer.py Outdated

Comment thread megatron/core/optimizer/optimizer.py

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 18, 2026

yuchenwang3 added 2 commits June 18, 2026 14:24

yuchenwang3 requested a review from ShauryaaSharma June 18, 2026 22:28

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 19, 2026

ShauryaaSharma approved these changes Jun 20, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers#5395

fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers#5395
yuchenwang3 wants to merge 4 commits into
NVIDIA:mainfrom
yuchenwang3:fix/skip-grad-clip-for-muon

yuchenwang3 commented Jun 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

ShauryaaSharma left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuchenwang3 commented Jun 18, 2026

Uh oh!

yuchenwang3 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yuchenwang3 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope & root cause (optimizer-agnostic)

The harmful path, concretely

Why not a config-level fix

Change

Testing

Alternatives

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

ShauryaaSharma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuchenwang3 commented Jun 18, 2026

Uh oh!

yuchenwang3 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuchenwang3 commented Jun 17, 2026 •

edited

Loading