Skip to content

fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229)#230

Merged
skyw merged 3 commits into
NVIDIA-NeMo:mainfrom
yuchenwang3:fix/newton-schulz-eps-scale-invariance
Jun 22, 2026
Merged

fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229)#230
skyw merged 3 commits into
NVIDIA-NeMo:mainfrom
yuchenwang3:fix/newton-schulz-eps-scale-invariance

Conversation

@yuchenwang3

Copy link
Copy Markdown
Contributor

What

newton_schulz is meant to be scale-invariant — the orthogonalization of x and c*x should be identical. It is not: the internal F.normalize(x, p=2, dim=(-2, -1), eps=1e-7) divides a small-norm input by eps instead of its norm once ||x||_F < eps, so ||X||_F = ||x||_F / eps << 1 and the Newton–Schulz iteration (tuned for singular values ~1) cannot lift it → a silently degenerate, non-orthogonal output (no warning, no error).

Fix

Lower the eps default 1e-7 → 1e-30 so it acts purely as a divide-by-zero guard, which is its documented purpose (eps: Small constant to avoid division by zero.). Input is enforced float32, so 1e-30 is a safe guard. This covers all paths, since they share this eps:

  • non-TP: torch.nn.functional.normalize(..., eps=eps)
  • TP: distributed_normalize_p2(x, eps, tp_group) (x / sqrt(sum).clamp_min(eps))
  • newton_schulz_tp routes through newton_schulz, so it inherits the fix.

Before / after (standalone, CPU, mirrors the F.normalize(..., eps=1e-7) prelude)

eps=1e-7 :  ||in||_F=1e-9 -> ||out||_F=16.1 ,  1e-10 -> 1.67    (ideal ~45)   <- DEGENERATE
eps=1e-30:  ||out||_F stays ~45 across all input scales                       <- scale-invariant

Test

Adds test_newtonschulz_scale_invariance (asserts the orthogonalized output is identical for x and scale * x, scale ∈ {1e-2, 1e-6, 1e-9, 1e-12}). It fails on the old eps=1e-7 default and passes with the fix.

Alternatives (happy to switch to maintainers' preference)

  1. eps = torch.finfo(x.dtype).tiny (dtype-relative guard).
  2. Warn / raise when ||x||_F < eps instead of changing the default (keeps current behavior but stops the silent failure).

Note

Could not run the test suite locally (macOS arm64 has no triton wheel, so the package fails to import); relying on CI. The standalone repro above was verified on CPU.

Why it matters

Combined with a training framework that applies gradient-norm clipping to the Muon param group (e.g. Megatron-LM ChainedOptimizer), the clip coefficient can scale per-matrix gradients below this floor → Newton–Schulz silently emits degenerate updates → training stalls while the forward/loss look completely normal, which is very hard to diagnose.

Fixes #229

…puts

`newton_schulz` should be scale-invariant, but the internal
`F.normalize(x, p=2, dim=(-2,-1), eps=1e-7)` divides a small-norm input by
`eps` instead of its norm once `||x||_F < eps`, so the iteration (tuned for
singular values ~1) cannot lift it and the output silently degenerates to a
non-orthogonal matrix.

Lower the `eps` default 1e-7 -> 1e-30 so it acts purely as a divide-by-zero
guard (its documented purpose). Input is enforced fp32, so 1e-30 is safe. This
covers the non-TP `F.normalize`, the TP `distributed_normalize_p2`, and
`newton_schulz_tp` (which routes through `newton_schulz`).

Add `test_newtonschulz_scale_invariance` regression test.

Fixes NVIDIA-NeMo#229

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a silent scale-invariance bug in newton_schulz where the internal F.normalize eps default (1e-7) would act as a divisor rather than a zero-guard for small-norm inputs, producing degenerate non-orthogonal outputs without any error or warning.

  • Core fix: lowers the eps default from 1e-7 to 1e-15 in newton_schulz (and, through shared code, in the TP path via distributed_normalize_p2), covering the most common failure modes from gradient-norm clipping interactions.
  • Regression test: adds test_newtonschulz_small_eps which verifies that newton_schulz(scale * x) equals newton_schulz(x) for scales down to 1e-12, correctly failing on the old default and passing with the fix.

Confidence Score: 5/5

Safe to merge; the one-line default change and accompanying test correctly address the described silent failure.

The change is minimal and targeted: a single default value is lowered and a focused regression test is added. The fix is correct — the new eps=1e-15 ensures F.normalize and distributed_normalize_p2 never override a valid small-norm input with the guard value for any input reachable in float32 practice. The two observations flagged are non-blocking: a discrepancy between the PR description's stated target (1e-30) and the implemented value (1e-15), and a comment that references squaring eps when no code path does so.

The default value in muon_utils.py is worth a second look against the PR description's stated target of 1e-30.

Important Files Changed

Filename Overview
emerging_optimizers/orthogonalized_optimizers/muon_utils.py Default eps lowered from 1e-7 to 1e-15 to fix silent degenerate output for small-norm inputs; a clarifying comment was added. The PR description targets 1e-30 but the implementation uses 1e-15, leaving a narrow gap where the original bug could still manifest.
tests/test_muon_utils.py Adds test_newtonschulz_small_eps with four scale factors down to 1e-12 to regression-test scale invariance. Test design is sound and correctly catches the old 1e-7 bug.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["newton_schulz(x, eps)"] --> B{tp_group?}
    B -- yes --> C["distributed_normalize_p2(x, eps, tp_group)\nx / sqrt(x_sq_sum).clamp_min(eps)"]
    B -- no --> D["F.normalize(x, p=2, dim=(-2,-1), eps=eps)\nx / max(||x||_F, eps)"]
    C --> E["Newton–Schulz iterations\n(singular values tuned for ~1)"]
    D --> E
    E --> F["Orthogonalized output X"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["newton_schulz(x, eps)"] --> B{tp_group?}
    B -- yes --> C["distributed_normalize_p2(x, eps, tp_group)\nx / sqrt(x_sq_sum).clamp_min(eps)"]
    B -- no --> D["F.normalize(x, p=2, dim=(-2,-1), eps=eps)\nx / max(||x||_F, eps)"]
    C --> E["Newton–Schulz iterations\n(singular values tuned for ~1)"]
    D --> E
    E --> F["Orthogonalized output X"]
Loading

Reviews (3): Last reviewed commit: "address review: trim NOTE to a #229 refe..." | Re-trigger Greptile

@yuchenwang3

yuchenwang3 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Background / motivation: surfaced while running ms-swift + Megatron-Core SFT of Qwen3.5-35B-A3B (GatedDeltaNet hybrid, optimizer=dist_muon) on 16× B200 across 2 nodes (2×8). A benign-but-large global grad_norm made the framework's gradient-clip coefficient tiny (≈2e-8), which pushed Muon's per-matrix gradients below this F.normalize(eps=1e-7) floor → newton_schulz silently returned a degenerate, non-orthogonal update → training stalled with forward/loss looking normal. The hard-to-diagnose part was exactly the silent degeneration, hence this scale-robustness fix. Megatron-side counterpart (the clip that produces the tiny coefficient): NVIDIA/Megatron-LM#5394 and PR NVIDIA/Megatron-LM#5395.

@skyw skyw added the enhancement New feature or request label Jun 22, 2026
@skyw is right that 1e-30 underflows in fp32 when squared (1e-30**2 = 1e-60).
1e-15 keeps eps**2 representable (1e-15**2 = 1e-30, a normal fp32 value) while
staying far below any realistic ||x||_F, so small-norm inputs still normalize by
their true norm instead of the guard. Within the 1e-12..1e-15 range you suggested;
chose 1e-15 for maximum dynamic range below the floor. See NVIDIA-NeMo#229.

Signed-off-by: yuchenwang3 <eang333cms@gmail.com>

@skyw skyw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eps change LGTM. rest is not necessary.

# ``||x||_F`` (input is fp32). If it is too large, a small-norm input is divided by ``eps``
# instead of its norm, so ``||X||_F = ||x||_F / eps << 1`` and the iteration (tuned for
# singular values ~1) cannot lift it -> silently degenerate, non-orthogonal output. This
# breaks the scale-invariance of orthogonalization. It must also stay above ~1e-15 so that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statement is too strong. a reference to #229 should be sufficientl

Comment thread tests/test_muon_utils.py Outdated
)

@parameterized.parameters(1e-2, 1e-6, 1e-9, 1e-12)
def test_newtonschulz_scale_invariance(self, scale):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name is not quite right. I'd use more explicit name like test small eps etc.

@skyw

skyw commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

/ok to test dcf8361

… to test_newtonschulz_small_eps

Per @skyw review on NVIDIA-NeMo#230: keep the fp32-safe eps=1e-15 change, shorten
the over-long NOTE to a brief NVIDIA-NeMo#229 reference, and give the regression
test a more explicit name.

Signed-off-by: Yuchen Wang <yw.yy953e@alibaba-inc.com>
@yuchenwang3

Copy link
Copy Markdown
Contributor Author

Done in 620196c — both addressed:

  • Trimmed the NOTE down to a one-line See issue #229 reference.
  • Renamed the test to test_newtonschulz_small_eps.

eps=1e-15 kept as you OK'd. Ready to merge whenever the re-run is green.

@skyw

skyw commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

/ok to test 620196c

@skyw skyw enabled auto-merge (squash) June 22, 2026 17:10
@skyw skyw merged commit 7ebd3cf into NVIDIA-NeMo:main Jun 22, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

newton_schulz silently degenerates (non-orthogonal output) for small-norm inputs due to F.normalize eps floor

2 participants