fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229) by yuchenwang3 · Pull Request #230 · NVIDIA-NeMo/Emerging-Optimizers

yuchenwang3 · 2026-06-17T20:25:37Z

What

newton_schulz is meant to be scale-invariant — the orthogonalization of x and c*x should be identical. It is not: the internal F.normalize(x, p=2, dim=(-2, -1), eps=1e-7) divides a small-norm input by eps instead of its norm once ||x||_F < eps, so ||X||_F = ||x||_F / eps << 1 and the Newton–Schulz iteration (tuned for singular values ~1) cannot lift it → a silently degenerate, non-orthogonal output (no warning, no error).

Fix

Lower the eps default 1e-7 → 1e-30 so it acts purely as a divide-by-zero guard, which is its documented purpose (eps: Small constant to avoid division by zero.). Input is enforced float32, so 1e-30 is a safe guard. This covers all paths, since they share this eps:

non-TP: torch.nn.functional.normalize(..., eps=eps)
TP: distributed_normalize_p2(x, eps, tp_group) (x / sqrt(sum).clamp_min(eps))
newton_schulz_tp routes through newton_schulz, so it inherits the fix.

Before / after (standalone, CPU, mirrors the `F.normalize(..., eps=1e-7)` prelude)

eps=1e-7 :  ||in||_F=1e-9 -> ||out||_F=16.1 ,  1e-10 -> 1.67    (ideal ~45)   <- DEGENERATE
eps=1e-30:  ||out||_F stays ~45 across all input scales                       <- scale-invariant

Test

Adds test_newtonschulz_scale_invariance (asserts the orthogonalized output is identical for x and scale * x, scale ∈ {1e-2, 1e-6, 1e-9, 1e-12}). It fails on the old eps=1e-7 default and passes with the fix.

Alternatives (happy to switch to maintainers' preference)

eps = torch.finfo(x.dtype).tiny (dtype-relative guard).
Warn / raise when ||x||_F < eps instead of changing the default (keeps current behavior but stops the silent failure).

Note

Could not run the test suite locally (macOS arm64 has no triton wheel, so the package fails to import); relying on CI. The standalone repro above was verified on CPU.

Why it matters

Combined with a training framework that applies gradient-norm clipping to the Muon param group (e.g. Megatron-LM ChainedOptimizer), the clip coefficient can scale per-matrix gradients below this floor → Newton–Schulz silently emits degenerate updates → training stalls while the forward/loss look completely normal, which is very hard to diagnose.

Fixes #229

…puts `newton_schulz` should be scale-invariant, but the internal `F.normalize(x, p=2, dim=(-2,-1), eps=1e-7)` divides a small-norm input by `eps` instead of its norm once `||x||_F < eps`, so the iteration (tuned for singular values ~1) cannot lift it and the output silently degenerates to a non-orthogonal matrix. Lower the `eps` default 1e-7 -> 1e-30 so it acts purely as a divide-by-zero guard (its documented purpose). Input is enforced fp32, so 1e-30 is safe. This covers the non-TP `F.normalize`, the TP `distributed_normalize_p2`, and `newton_schulz_tp` (which routes through `newton_schulz`). Add `test_newtonschulz_scale_invariance` regression test. Fixes NVIDIA-NeMo#229 Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

copy-pr-bot · 2026-06-17T20:25:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-17T20:28:40Z

Greptile Summary

This PR fixes a silent scale-invariance bug in newton_schulz where the internal F.normalize eps default (1e-7) would act as a divisor rather than a zero-guard for small-norm inputs, producing degenerate non-orthogonal outputs without any error or warning.

Core fix: lowers the eps default from 1e-7 to 1e-15 in newton_schulz (and, through shared code, in the TP path via distributed_normalize_p2), covering the most common failure modes from gradient-norm clipping interactions.
Regression test: adds test_newtonschulz_small_eps which verifies that newton_schulz(scale * x) equals newton_schulz(x) for scales down to 1e-12, correctly failing on the old default and passing with the fix.

Confidence Score: 5/5

Safe to merge; the one-line default change and accompanying test correctly address the described silent failure.

The change is minimal and targeted: a single default value is lowered and a focused regression test is added. The fix is correct — the new eps=1e-15 ensures F.normalize and distributed_normalize_p2 never override a valid small-norm input with the guard value for any input reachable in float32 practice. The two observations flagged are non-blocking: a discrepancy between the PR description's stated target (1e-30) and the implemented value (1e-15), and a comment that references squaring eps when no code path does so.

The default value in muon_utils.py is worth a second look against the PR description's stated target of 1e-30.

Important Files Changed

Filename	Overview
emerging_optimizers/orthogonalized_optimizers/muon_utils.py	Default `eps` lowered from `1e-7` to `1e-15` to fix silent degenerate output for small-norm inputs; a clarifying comment was added. The PR description targets `1e-30` but the implementation uses `1e-15`, leaving a narrow gap where the original bug could still manifest.
tests/test_muon_utils.py	Adds `test_newtonschulz_small_eps` with four scale factors down to `1e-12` to regression-test scale invariance. Test design is sound and correctly catches the old `1e-7` bug.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["newton_schulz(x, eps)"] --> B{tp_group?}
    B -- yes --> C["distributed_normalize_p2(x, eps, tp_group)\nx / sqrt(x_sq_sum).clamp_min(eps)"]
    B -- no --> D["F.normalize(x, p=2, dim=(-2,-1), eps=eps)\nx / max(||x||_F, eps)"]
    C --> E["Newton–Schulz iterations\n(singular values tuned for ~1)"]
    D --> E
    E --> F["Orthogonalized output X"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["newton_schulz(x, eps)"] --> B{tp_group?}
    B -- yes --> C["distributed_normalize_p2(x, eps, tp_group)\nx / sqrt(x_sq_sum).clamp_min(eps)"]
    B -- no --> D["F.normalize(x, p=2, dim=(-2,-1), eps=eps)\nx / max(||x||_F, eps)"]
    C --> E["Newton–Schulz iterations\n(singular values tuned for ~1)"]
    D --> E
    E --> F["Orthogonalized output X"]

_{Reviews (3): Last reviewed commit: "address review: trim NOTE to a #229 refe..." | Re-trigger Greptile}

yuchenwang3 · 2026-06-17T20:58:10Z

Background / motivation: surfaced while running ms-swift + Megatron-Core SFT of Qwen3.5-35B-A3B (GatedDeltaNet hybrid, optimizer=dist_muon) on 16× B200 across 2 nodes (2×8). A benign-but-large global grad_norm made the framework's gradient-clip coefficient tiny (≈2e-8), which pushed Muon's per-matrix gradients below this F.normalize(eps=1e-7) floor → newton_schulz silently returned a degenerate, non-orthogonal update → training stalled with forward/loss looking normal. The hard-to-diagnose part was exactly the silent degeneration, hence this scale-robustness fix. Megatron-side counterpart (the clip that produces the tiny coefficient): NVIDIA/Megatron-LM#5394 and PR NVIDIA/Megatron-LM#5395.

@skyw

@skyw is right that 1e-30 underflows in fp32 when squared (1e-30**2 = 1e-60). 1e-15 keeps eps**2 representable (1e-15**2 = 1e-30, a normal fp32 value) while staying far below any realistic ||x||_F, so small-norm inputs still normalize by their true norm instead of the guard. Within the 1e-12..1e-15 range you suggested; chose 1e-15 for maximum dynamic range below the floor. See NVIDIA-NeMo#229. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

skyw

eps change LGTM. rest is not necessary.

skyw · 2026-06-22T15:58:59Z

+    # ``||x||_F`` (input is fp32). If it is too large, a small-norm input is divided by ``eps``
+    # instead of its norm, so ``||X||_F = ||x||_F / eps << 1`` and the iteration (tuned for
+    # singular values ~1) cannot lift it -> silently degenerate, non-orthogonal output. This
+    # breaks the scale-invariance of orthogonalization. It must also stay above ~1e-15 so that


Statement is too strong. a reference to #229 should be sufficientl

skyw · 2026-06-22T16:00:34Z

        )

+    @parameterized.parameters(1e-2, 1e-6, 1e-9, 1e-12)
+    def test_newtonschulz_scale_invariance(self, scale):


Test name is not quite right. I'd use more explicit name like test small eps etc.

skyw · 2026-06-22T16:02:01Z

/ok to test dcf8361

@skyw

… to test_newtonschulz_small_eps Per @skyw review on NVIDIA-NeMo#230: keep the fp32-safe eps=1e-15 change, shorten the over-long NOTE to a brief NVIDIA-NeMo#229 reference, and give the regression test a more explicit name. Signed-off-by: Yuchen Wang <yw.yy953e@alibaba-inc.com>

yuchenwang3 · 2026-06-22T17:04:15Z

Done in 620196c — both addressed:

Trimmed the NOTE down to a one-line See issue #229 reference.
Renamed the test to test_newtonschulz_small_eps.

eps=1e-15 kept as you OK'd. Ready to merge whenever the re-run is green.

skyw · 2026-06-22T17:06:09Z

/ok to test 620196c

This was referenced Jun 17, 2026

[BUG] ChainedOptimizer applies global grad-norm clipping to Muon (orthogonalizing) param groups, silently stalling training NVIDIA/Megatron-LM#5394

Open

fix(optimizer): skip grad-norm clipping for orthogonalizing (Muon) optimizers NVIDIA/Megatron-LM#5395

Open

skyw added the enhancement New feature or request label Jun 22, 2026

yuchenwang3 mentioned this pull request Jun 22, 2026

newton_schulz silently degenerates (non-orthogonal output) for small-norm inputs due to F.normalize eps floor #229

Closed

skyw requested changes Jun 22, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test June 22, 2026 16:02 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 16:02 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 16:05 Inactive

skyw approved these changes Jun 22, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test June 22, 2026 17:06 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 17:06 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 17:07 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 17:10 Inactive

skyw enabled auto-merge (squash) June 22, 2026 17:10

skyw merged commit 7ebd3cf into NVIDIA-NeMo:main Jun 22, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229)#230

fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229)#230
skyw merged 3 commits into
NVIDIA-NeMo:mainfrom
yuchenwang3:fix/newton-schulz-eps-scale-invariance

yuchenwang3 commented Jun 17, 2026

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

yuchenwang3 commented Jun 17, 2026 •

edited

Loading

Uh oh!

skyw left a comment

Uh oh!

skyw Jun 22, 2026

Uh oh!

skyw Jun 22, 2026

Uh oh!

skyw commented Jun 22, 2026

Uh oh!

yuchenwang3 commented Jun 22, 2026

Uh oh!

skyw commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuchenwang3 commented Jun 17, 2026

What

Fix

Before / after (standalone, CPU, mirrors the F.normalize(..., eps=1e-7) prelude)

Test

Alternatives (happy to switch to maintainers' preference)

Note

Why it matters

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

yuchenwang3 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skyw left a comment

Choose a reason for hiding this comment

Uh oh!

skyw Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

skyw Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

skyw commented Jun 22, 2026

Uh oh!

yuchenwang3 commented Jun 22, 2026

Uh oh!

skyw commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Before / after (standalone, CPU, mirrors the `F.normalize(..., eps=1e-7)` prelude)

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

yuchenwang3 commented Jun 17, 2026 •

edited

Loading