Skip to content

fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers#5395

Open
yuchenwang3 wants to merge 4 commits into
NVIDIA:mainfrom
yuchenwang3:fix/skip-grad-clip-for-muon
Open

fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers#5395
yuchenwang3 wants to merge 4 commits into
NVIDIA:mainfrom
yuchenwang3:fix/skip-grad-clip-for-muon

Conversation

@yuchenwang3

@yuchenwang3 yuchenwang3 commented Jun 17, 2026

Copy link
Copy Markdown

Background. Encountered while running ms-swift + Megatron-Core SFT of Qwen3.5-35B-A3B (GatedDeltaNet hybrid, optimizer=dist_muon) on 16× B200 across 2 nodes (2×8). Full root-cause write-up in #5394.

Scope & root cause (optimizer-agnostic)

The underlying failure is not Muon-specific. A benign-but-large global grad_norm (here ~5e7 from the GatedDeltaNet layers, growing to ~2e11) makes the clip coefficient c = clip_grad / grad_norm ≈ 2e-8, which scales per-matrix gradients below the optimizer's numerical floor and freezes updates. The same condition stalls AdamW too (its eps floor) — i.e. the clip, not the optimizer, is the root cause. This is now confirmed directly (see Testing): AdamW with clip_grad=1.0 stalls at ~0.5, while AdamW with clip_grad=0 overfits a fixed batch to 0.029 (vs Muon's 0.019). (We used dist_muon here following the model's training recipe; it is not required — it is simply naturally immune to this clip, since magnitude clipping is a no-op for an orthogonalizing update.)

This PR fixes the case that has a clean, side-effect-free fix: for an orthogonalizing optimizer, magnitude clipping is semantically meaningless — Newton–Schulz discards gradient magnitude — so skipping it is unambiguously correct. There is no equally clean "skip clip" for a magnitude-based optimizer (it legitimately uses clipping for spike protection); the optimizer-agnostic alternative is to clamp the clip coefficient so a benign-large grad_norm can't drive gradients under the eps floor (listed below).

The harmful path, concretely

ChainedOptimizer.step clips every non-stub sub-optimizer's main_params with the global grad_norm whenever optimizer.config.clip_grad > 0. Applied to a Muon sub-optimizer, the tiny c ≈ 2e-8 pushes its per-matrix gradients below Newton–Schulz's internal F.normalize floor, so the orthogonalization silently degenerates (near-zero, non-orthogonal update) and training stalls — with the forward/loss looking completely normal.

Why not a config-level fix

Setting clip_grad=0 only for the Muon sub-optimizer is not possible via config, because ChainedOptimizer requires all sub-optimizers to share the same OptimizerConfig (the DistOpt+LayerWise path even resets result.config = config to satisfy this). So this PR uses a per-optimizer attribute instead.

Change

  • ChainedOptimizer.step: skip the norm-clip for a sub-optimizer carrying skip_grad_norm_clip = True (defaults to False via getattr, so all existing optimizers are unaffected). grad_norm is still computed and returned for logging; only the clip is skipped, and only for the flagged (Muon) sub-optimizer — Adam sub-optimizers still clip normally.
  • get_megatron_optimizer / _build_emerging_optimizer: set skip_grad_norm_clip = True on the emerging (Muon) optimizer — both the standard-chain wrapper and the LayerWiseDistributedOptimizer (which owns the Muon-managed matrix params for dist_muon).

Testing

The runtime effect — not magnitude-clipping the Muon param group — was validated on the actual Qwen3.5-35B-A3B (GatedDeltaNet) model on 16× B200 across 2 nodes (2×8), optimizer=dist_muon:

  • Muon group's gradient clipping off (equivalent to this PR; in our run via clip_grad=0): a fixed 32-example batch overfits to ~0.02, real-data training learns.
  • Global clip on (clip_grad=1.0, current behavior): training stalls (loss flat ~0.5, grad_norm 7.5e7 → 2.2e11).
  • Dual single-variable control: keeping clip_grad=1.0 but only lowering Newton–Schulz eps 1e-7→1e-30 also unblocks learning, confirming the clip → NS-eps interaction.
  • Optimizer-agnostic confirmation (optimizer=adam, fixed 64-example overfit, 200 steps): clip_grad=0 → loss 0.69 → 0.029 (vs Muon clip_grad=0 → 0.019), while clip_grad=1.0 stalls at ~0.5. grad_norm stayed 1e8–3e12 throughout with no NaN — Adam's per-parameter second-moment normalization digests the raw gradients fine once they are not crushed by the clip. So the clip is the sole cause across optimizers; Adam is viable without it, and Muon is not required for correctness (just lower/steadier).

This PR reproduces that validated behavior via the skip_grad_norm_clip attribute instead of a global clip_grad=0; the exact attribute wiring should additionally be exercised by CI.

Alternatives

  1. A config/CLI flag (e.g. --no-clip-orthogonalizing-optimizers) instead of an implicit attribute.
  2. Detect orthogonalizing optimizers by type rather than an attribute.
  3. A more general fix that also helps magnitude-based optimizers: clamp the clip coefficient with a floor so a benign-but-large grad_norm can't drive per-matrix gradients under the optimizer's numerical eps. (Broader blast radius — affects all training — hence not chosen here.)
  4. Pair with the related robustness fix so the orthogonalizer doesn't silently degenerate on small inputs: newton_schulz silently degenerates (non-orthogonal output) for small-norm inputs due to F.normalize eps floor NVIDIA-NeMo/Emerging-Optimizers#229 / PR fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229) NVIDIA-NeMo/Emerging-Optimizers#230.

Refs #5394

…timizers

ChainedOptimizer.step magnitude-clips every sub-optimizer with config.clip_grad>0,
including Muon. Muon's update is scale-invariant (Newton-Schulz discards gradient
magnitude), so the clip is a no-op at best; at worst, a large global grad_norm makes
the clip coefficient tiny and pushes per-matrix gradients below Newton-Schulz's
normalization floor, silently degenerating the orthogonalization and stalling training.

A config-level clip_grad=0 for the Muon group is not possible because ChainedOptimizer
requires all sub-optimizers to share one OptimizerConfig. Use a per-optimizer
'skip_grad_norm_clip' attribute (default False via getattr) instead, set on the
emerging (Muon) optimizer and the LayerWiseDistributedOptimizer that owns its params.

Refs NVIDIA#5394

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ShauryaaSharma ShauryaaSharma left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request Changes — the fix is directionally correct but has two blocking issues.

Bug 1 (Critical): The skip_grad_norm_clip flag is set on the layer_wise_optimizer
container, but when results is empty the training loop calls .step() directly on it.
The inherited ChainedOptimizer.step() then iterates the inner per-layer sub-optimizers —
none of which carry the flag — so clipping still fires. The fix is entirely bypassed for
this code path.

Bug 2 (High): The flag is applied unconditionally to every optimizer routed through
_get_megatron_emerging_optimizer, including SOAP and Lion. These are not scale-invariant
and should remain subject to clipping.

Two medium/low findings are also included as inline comments (grad_norm_skip_threshold
pollution and the should_clip wasted AllReduce).

Comment thread megatron/core/optimizer/__init__.py Outdated
Comment thread megatron/core/optimizer/__init__.py Outdated
Comment thread megatron/core/optimizer/optimizer.py Outdated
Comment thread megatron/core/optimizer/optimizer.py
…ip-threshold norm

Addresses @ShauryaaSharma's review on NVIDIA#5395:
- Critical: set skip_grad_norm_clip on the raw orthogonalizing sub-optimizers and
  re-propagate onto LayerWiseDistributedOptimizer.chained_optimizers after construction
  (LayerWise re-wraps base opts in Float16OptimizerWithFloat16Params under bf16; wrappers
  don't forward attribute access, so the direct path needs the flag on the actual members).
- High: gate the flag on isinstance(opt, OrthogonalizedOptimizer) so SOAP/Lion still clip;
  container flag now set only when every base sub is orthogonalizing.
- Medium: _get_grad_norm_skip_threshold() excludes skip-flagged subs' grads from the norm
  used for grad_norm_skip_threshold (Muon's huge grad no longer trips skip for Adam subs);
  identical to get_grad_norm() when nothing is flagged. Scoped to the threshold check.
- Low: should_clip predicate now also checks skip_grad_norm_clip (no wasted AllReduce).

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 18, 2026
…_skip_threshold

The review-revision's _get_grad_norm_skip_threshold() unconditionally called
get_grad_stats_parallel_group(), which asserts all chained sub-optimizers share
one grad-stats group. On the distributed-optimizer path (Muon LayerWise sub +
Adam DistributedOptimizer sub) the groups differ, so it raised AssertionError
and broke every use_param_layout=True layer-wise test (regression vs base).

Now compute sharedness over the NON-SKIP subs only (sidestepping the container
assertion), return early for the all-skip (Muon-only -> 0.0) and single-non-skip
cases, and fall back to combining per-sub norms when groups are non-shared --
mirroring get_grad_norm()'s own shared/non-shared handling. Found by on-node
8xH100 regression testing.

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
…topt path)

Covers the four review fixes and the distributed-optimizer regression:
- B1: skip flag on the bf16-wrapper members of LayerWiseDistributedOptimizer
  .chained_optimizers (Muon-only container flagged; Adam sub in a mix unflagged).
- B2: isinstance(OrthogonalizedOptimizer) gate (muon flagged, lion not).
- B3: _get_grad_norm_skip_threshold() excludes flagged subs' grads; Adam update
  not wrongly skipped by a Muon sub's huge grad.
- B4: should_clip short-circuits a Muon-only chain.
- distopt path: get_megatron_optimizer with Muon LayerWise + Adam DistributedOptimizer
  steps without the shared-grad-stats-group assertion.

Verified on an 8-GPU (sm90) node, torchrun nproc=2: 8 passed.

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
@yuchenwang3

Copy link
Copy Markdown
Author

Thanks for the thorough review, @ShauryaaSharma — all four findings are addressed, and I verified the result on an 8-GPU (sm90/H100-class) node.

1 (Critical). The flag is now set on the raw orthogonalizing sub before it's appended to the LayerWise base list, and re-propagated onto layer_wise_optimizer.chained_optimizers after construction. The second step is required: LayerWiseDistributedOptimizer re-wraps each base optimizer in Float16OptimizerWithFloat16Params under bf16, and the wrappers don't forward attribute access, so the inherited ChainedOptimizer.step() (the results-empty direct path) reads the flag off the wrapper. Verified: a Muon-only bf16 build returns a bare LayerWiseDistributedOptimizer whose every chained member carries the flag; a Muon+Adam build leaves the Adam sub unflagged.

2 (High). Gated on isinstance(optimizer, OrthogonalizedOptimizer), computed on the raw optimizer. On emerging_optimizers 0.4.0a0 this hits exactly the Muon family (Muon, AdaptiveMuon, Scion, Spel, MOP, PolarGrad, MuonHyperball, SinkhornMuon); SOAP and Lion are not subclasses and keep clipping. Container flag set only when all base subs are orthogonalizing. Added a guarded OrthogonalizedOptimizer import with an empty-tuple sentinel so isinstance is safely False if the package is absent.

3 (Medium). Added _get_grad_norm_skip_threshold(), which excludes skip-flagged subs' grads from the norm used for the grad_norm_skip_threshold comparison (and exempts skip-flagged subs from the check). It returns get_grad_norm() unchanged when nothing is flagged, so no behavior change for non-Muon users. Scoped to the threshold check (the clip already short-circuits for skip-flagged subs). Verified: a Muon sub at grad_norm ≈ 6.2e6 + a well-behaved Adam sub → get_grad_norm()=6.2e6 vs _get_grad_norm_skip_threshold()=4.0e-4, so the Adam update is no longer wrongly skipped.

4 (Low). should_clip now also checks skip_grad_norm_clip, so a Muon-only chain no longer triggers _compute_grad_norms_by_group() (verified: 0 calls for a Muon-only chain, ≥1 for an Adam chain).

One issue I caught and fixed while testing (flagging for transparency). My first cut of _get_grad_norm_skip_threshold() called get_grad_stats_parallel_group() unconditionally, which asserts all chained subs share one grad-stats group. On the distributed-optimizer path (Muon LayerWiseDistributedOptimizer + Adam DistributedOptimizer) the groups differ, so it raised AssertionError and broke the use_param_layout=True layer-wise tests. The latest commit restores get_grad_norm()'s shared/non-shared handling — sharedness is judged over the non-skip subs only (sidestepping the container assertion), with a per-sub-norm fallback when non-shared and an early return for the Muon-only case. Those [True] tests are green again.

Verification (8-GPU node, torchrun, vs base ad5a93b). No new failures vs base: test_layer_wise_optimizer.py is back to 41/0 (the 15 [True] failures cleared), optimizer/ and test_lion_optimizer.py unchanged. (There are 27 pre-existing adaptive_muon/soap failures from an emerging_optimizers 0.4.0a0 API skew — identical on base and branch, unrelated to this PR.) I also added tests/unit_tests/optimizer/test_skip_grad_norm_clip.py covering the four behaviors above plus the distributed-optimizer path (8/8 pass).

The clip total_norm for non-skip subs still uses the full grad_norm — I scoped finding 3 to the threshold as you suggested; happy to narrow the clip norm too in a follow-up if you'd prefer. Ready for another look.

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 19, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 21, 2026
@yuchenwang3

Copy link
Copy Markdown
Author

Thanks again for the review and approval, @ShauryaaSharma 🙏 This still shows REVIEW_REQUIRED and the copy-pr-bot CI hasn't been triggered yet, so it can't merge as-is. Anything needed from my side — happy to rebase onto latest main? If it's just the CI / CODEOWNER step, could you /ok to test or loop in someone from the optimizer CODEOWNERS to help land it? Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants