fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers#5395
fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers#5395yuchenwang3 wants to merge 4 commits into
Conversation
…timizers ChainedOptimizer.step magnitude-clips every sub-optimizer with config.clip_grad>0, including Muon. Muon's update is scale-invariant (Newton-Schulz discards gradient magnitude), so the clip is a no-op at best; at worst, a large global grad_norm makes the clip coefficient tiny and pushes per-matrix gradients below Newton-Schulz's normalization floor, silently degenerating the orthogonalization and stalling training. A config-level clip_grad=0 for the Muon group is not possible because ChainedOptimizer requires all sub-optimizers to share one OptimizerConfig. Use a per-optimizer 'skip_grad_norm_clip' attribute (default False via getattr) instead, set on the emerging (Muon) optimizer and the LayerWiseDistributedOptimizer that owns its params. Refs NVIDIA#5394 Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
ShauryaaSharma
left a comment
There was a problem hiding this comment.
Request Changes — the fix is directionally correct but has two blocking issues.
Bug 1 (Critical): The skip_grad_norm_clip flag is set on the layer_wise_optimizer
container, but when results is empty the training loop calls .step() directly on it.
The inherited ChainedOptimizer.step() then iterates the inner per-layer sub-optimizers —
none of which carry the flag — so clipping still fires. The fix is entirely bypassed for
this code path.
Bug 2 (High): The flag is applied unconditionally to every optimizer routed through
_get_megatron_emerging_optimizer, including SOAP and Lion. These are not scale-invariant
and should remain subject to clipping.
Two medium/low findings are also included as inline comments (grad_norm_skip_threshold
pollution and the should_clip wasted AllReduce).
…ip-threshold norm Addresses @ShauryaaSharma's review on NVIDIA#5395: - Critical: set skip_grad_norm_clip on the raw orthogonalizing sub-optimizers and re-propagate onto LayerWiseDistributedOptimizer.chained_optimizers after construction (LayerWise re-wraps base opts in Float16OptimizerWithFloat16Params under bf16; wrappers don't forward attribute access, so the direct path needs the flag on the actual members). - High: gate the flag on isinstance(opt, OrthogonalizedOptimizer) so SOAP/Lion still clip; container flag now set only when every base sub is orthogonalizing. - Medium: _get_grad_norm_skip_threshold() excludes skip-flagged subs' grads from the norm used for grad_norm_skip_threshold (Muon's huge grad no longer trips skip for Adam subs); identical to get_grad_norm() when nothing is flagged. Scoped to the threshold check. - Low: should_clip predicate now also checks skip_grad_norm_clip (no wasted AllReduce). Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
…_skip_threshold The review-revision's _get_grad_norm_skip_threshold() unconditionally called get_grad_stats_parallel_group(), which asserts all chained sub-optimizers share one grad-stats group. On the distributed-optimizer path (Muon LayerWise sub + Adam DistributedOptimizer sub) the groups differ, so it raised AssertionError and broke every use_param_layout=True layer-wise test (regression vs base). Now compute sharedness over the NON-SKIP subs only (sidestepping the container assertion), return early for the all-skip (Muon-only -> 0.0) and single-non-skip cases, and fall back to combining per-sub norms when groups are non-shared -- mirroring get_grad_norm()'s own shared/non-shared handling. Found by on-node 8xH100 regression testing. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
…topt path) Covers the four review fixes and the distributed-optimizer regression: - B1: skip flag on the bf16-wrapper members of LayerWiseDistributedOptimizer .chained_optimizers (Muon-only container flagged; Adam sub in a mix unflagged). - B2: isinstance(OrthogonalizedOptimizer) gate (muon flagged, lion not). - B3: _get_grad_norm_skip_threshold() excludes flagged subs' grads; Adam update not wrongly skipped by a Muon sub's huge grad. - B4: should_clip short-circuits a Muon-only chain. - distopt path: get_megatron_optimizer with Muon LayerWise + Adam DistributedOptimizer steps without the shared-grad-stats-group assertion. Verified on an 8-GPU (sm90) node, torchrun nproc=2: 8 passed. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
|
Thanks for the thorough review, @ShauryaaSharma — all four findings are addressed, and I verified the result on an 8-GPU (sm90/H100-class) node. 1 (Critical). The flag is now set on the raw orthogonalizing sub before it's appended to the LayerWise base list, and re-propagated onto 2 (High). Gated on 3 (Medium). Added 4 (Low). One issue I caught and fixed while testing (flagging for transparency). My first cut of Verification (8-GPU node, torchrun, vs base The clip |
|
Thanks again for the review and approval, @ShauryaaSharma 🙏 This still shows |
Background. Encountered while running ms-swift + Megatron-Core SFT of Qwen3.5-35B-A3B (GatedDeltaNet hybrid,
optimizer=dist_muon) on 16× B200 across 2 nodes (2×8). Full root-cause write-up in #5394.Scope & root cause (optimizer-agnostic)
The underlying failure is not Muon-specific. A benign-but-large global
grad_norm(here ~5e7 from the GatedDeltaNet layers, growing to ~2e11) makes the clip coefficientc = clip_grad / grad_norm ≈ 2e-8, which scales per-matrix gradients below the optimizer's numerical floor and freezes updates. The same condition stalls AdamW too (itsepsfloor) — i.e. the clip, not the optimizer, is the root cause. This is now confirmed directly (see Testing): AdamW withclip_grad=1.0stalls at ~0.5, while AdamW withclip_grad=0overfits a fixed batch to 0.029 (vs Muon's 0.019). (We useddist_muonhere following the model's training recipe; it is not required — it is simply naturally immune to this clip, since magnitude clipping is a no-op for an orthogonalizing update.)This PR fixes the case that has a clean, side-effect-free fix: for an orthogonalizing optimizer, magnitude clipping is semantically meaningless — Newton–Schulz discards gradient magnitude — so skipping it is unambiguously correct. There is no equally clean "skip clip" for a magnitude-based optimizer (it legitimately uses clipping for spike protection); the optimizer-agnostic alternative is to clamp the clip coefficient so a benign-large
grad_normcan't drive gradients under the eps floor (listed below).The harmful path, concretely
ChainedOptimizer.stepclips every non-stub sub-optimizer'smain_paramswith the globalgrad_normwheneveroptimizer.config.clip_grad > 0. Applied to a Muon sub-optimizer, the tinyc ≈ 2e-8pushes its per-matrix gradients below Newton–Schulz's internalF.normalizefloor, so the orthogonalization silently degenerates (near-zero, non-orthogonal update) and training stalls — with the forward/loss looking completely normal.Why not a config-level fix
Setting
clip_grad=0only for the Muon sub-optimizer is not possible viaconfig, becauseChainedOptimizerrequires all sub-optimizers to share the sameOptimizerConfig(the DistOpt+LayerWise path even resetsresult.config = configto satisfy this). So this PR uses a per-optimizer attribute instead.Change
ChainedOptimizer.step: skip the norm-clip for a sub-optimizer carryingskip_grad_norm_clip = True(defaults toFalseviagetattr, so all existing optimizers are unaffected).grad_normis still computed and returned for logging; only the clip is skipped, and only for the flagged (Muon) sub-optimizer — Adam sub-optimizers still clip normally.get_megatron_optimizer/_build_emerging_optimizer: setskip_grad_norm_clip = Trueon the emerging (Muon) optimizer — both the standard-chain wrapper and theLayerWiseDistributedOptimizer(which owns the Muon-managed matrix params fordist_muon).Testing
The runtime effect — not magnitude-clipping the Muon param group — was validated on the actual Qwen3.5-35B-A3B (GatedDeltaNet) model on 16× B200 across 2 nodes (2×8),
optimizer=dist_muon:clip_grad=0): a fixed 32-example batch overfits to ~0.02, real-data training learns.clip_grad=1.0, current behavior): training stalls (loss flat ~0.5,grad_norm7.5e7 → 2.2e11).clip_grad=1.0but only lowering Newton–Schulzeps1e-7→1e-30 also unblocks learning, confirming theclip → NS-epsinteraction.optimizer=adam, fixed 64-example overfit, 200 steps):clip_grad=0→ loss0.69 → 0.029(vs Muonclip_grad=0→ 0.019), whileclip_grad=1.0stalls at ~0.5.grad_normstayed 1e8–3e12 throughout with no NaN — Adam's per-parameter second-moment normalization digests the raw gradients fine once they are not crushed by the clip. So the clip is the sole cause across optimizers; Adam is viable without it, and Muon is not required for correctness (just lower/steadier).This PR reproduces that validated behavior via the
skip_grad_norm_clipattribute instead of a globalclip_grad=0; the exact attribute wiring should additionally be exercised by CI.Alternatives
--no-clip-orthogonalizing-optimizers) instead of an implicit attribute.grad_normcan't drive per-matrix gradients under the optimizer's numerical eps. (Broader blast radius — affects all training — hence not chosen here.)Refs #5394