fix(muon_utils): keep newton_schulz scale-invariant for small-norm inputs (#229)#230
Conversation
…puts `newton_schulz` should be scale-invariant, but the internal `F.normalize(x, p=2, dim=(-2,-1), eps=1e-7)` divides a small-norm input by `eps` instead of its norm once `||x||_F < eps`, so the iteration (tuned for singular values ~1) cannot lift it and the output silently degenerates to a non-orthogonal matrix. Lower the `eps` default 1e-7 -> 1e-30 so it acts purely as a divide-by-zero guard (its documented purpose). Input is enforced fp32, so 1e-30 is safe. This covers the non-TP `F.normalize`, the TP `distributed_normalize_p2`, and `newton_schulz_tp` (which routes through `newton_schulz`). Add `test_newtonschulz_scale_invariance` regression test. Fixes NVIDIA-NeMo#229 Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
Greptile SummaryThis PR fixes a silent scale-invariance bug in
Confidence Score: 5/5Safe to merge; the one-line default change and accompanying test correctly address the described silent failure. The change is minimal and targeted: a single default value is lowered and a focused regression test is added. The fix is correct — the new The default value in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["newton_schulz(x, eps)"] --> B{tp_group?}
B -- yes --> C["distributed_normalize_p2(x, eps, tp_group)\nx / sqrt(x_sq_sum).clamp_min(eps)"]
B -- no --> D["F.normalize(x, p=2, dim=(-2,-1), eps=eps)\nx / max(||x||_F, eps)"]
C --> E["Newton–Schulz iterations\n(singular values tuned for ~1)"]
D --> E
E --> F["Orthogonalized output X"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["newton_schulz(x, eps)"] --> B{tp_group?}
B -- yes --> C["distributed_normalize_p2(x, eps, tp_group)\nx / sqrt(x_sq_sum).clamp_min(eps)"]
B -- no --> D["F.normalize(x, p=2, dim=(-2,-1), eps=eps)\nx / max(||x||_F, eps)"]
C --> E["Newton–Schulz iterations\n(singular values tuned for ~1)"]
D --> E
E --> F["Orthogonalized output X"]
Reviews (3): Last reviewed commit: "address review: trim NOTE to a #229 refe..." | Re-trigger Greptile |
|
Background / motivation: surfaced while running ms-swift + Megatron-Core SFT of Qwen3.5-35B-A3B (GatedDeltaNet hybrid, |
@skyw is right that 1e-30 underflows in fp32 when squared (1e-30**2 = 1e-60). 1e-15 keeps eps**2 representable (1e-15**2 = 1e-30, a normal fp32 value) while staying far below any realistic ||x||_F, so small-norm inputs still normalize by their true norm instead of the guard. Within the 1e-12..1e-15 range you suggested; chose 1e-15 for maximum dynamic range below the floor. See NVIDIA-NeMo#229. Signed-off-by: yuchenwang3 <eang333cms@gmail.com>
skyw
left a comment
There was a problem hiding this comment.
eps change LGTM. rest is not necessary.
| # ``||x||_F`` (input is fp32). If it is too large, a small-norm input is divided by ``eps`` | ||
| # instead of its norm, so ``||X||_F = ||x||_F / eps << 1`` and the iteration (tuned for | ||
| # singular values ~1) cannot lift it -> silently degenerate, non-orthogonal output. This | ||
| # breaks the scale-invariance of orthogonalization. It must also stay above ~1e-15 so that |
There was a problem hiding this comment.
Statement is too strong. a reference to #229 should be sufficientl
| ) | ||
|
|
||
| @parameterized.parameters(1e-2, 1e-6, 1e-9, 1e-12) | ||
| def test_newtonschulz_scale_invariance(self, scale): |
There was a problem hiding this comment.
Test name is not quite right. I'd use more explicit name like test small eps etc.
|
/ok to test dcf8361 |
… to test_newtonschulz_small_eps Per @skyw review on NVIDIA-NeMo#230: keep the fp32-safe eps=1e-15 change, shorten the over-long NOTE to a brief NVIDIA-NeMo#229 reference, and give the regression test a more explicit name. Signed-off-by: Yuchen Wang <yw.yy953e@alibaba-inc.com>
|
Done in 620196c — both addressed:
eps=1e-15 kept as you OK'd. Ready to merge whenever the re-run is green. |
|
/ok to test 620196c |
What
newton_schulzis meant to be scale-invariant — the orthogonalization ofxandc*xshould be identical. It is not: the internalF.normalize(x, p=2, dim=(-2, -1), eps=1e-7)divides a small-norm input byepsinstead of its norm once||x||_F < eps, so||X||_F = ||x||_F / eps << 1and the Newton–Schulz iteration (tuned for singular values ~1) cannot lift it → a silently degenerate, non-orthogonal output (no warning, no error).Fix
Lower the
epsdefault1e-7 → 1e-30so it acts purely as a divide-by-zero guard, which is its documented purpose (eps: Small constant to avoid division by zero.). Input is enforcedfloat32, so1e-30is a safe guard. This covers all paths, since they share thiseps:torch.nn.functional.normalize(..., eps=eps)distributed_normalize_p2(x, eps, tp_group)(x / sqrt(sum).clamp_min(eps))newton_schulz_tproutes throughnewton_schulz, so it inherits the fix.Before / after (standalone, CPU, mirrors the
F.normalize(..., eps=1e-7)prelude)Test
Adds
test_newtonschulz_scale_invariance(asserts the orthogonalized output is identical forxandscale * x,scale ∈ {1e-2, 1e-6, 1e-9, 1e-12}). It fails on the oldeps=1e-7default and passes with the fix.Alternatives (happy to switch to maintainers' preference)
eps = torch.finfo(x.dtype).tiny(dtype-relative guard).||x||_F < epsinstead of changing the default (keeps current behavior but stops the silent failure).Note
Could not run the test suite locally (macOS arm64 has no
tritonwheel, so the package fails to import); relying on CI. The standalone repro above was verified on CPU.Why it matters
Combined with a training framework that applies gradient-norm clipping to the Muon param group (e.g. Megatron-LM
ChainedOptimizer), the clip coefficient can scale per-matrix gradients below this floor → Newton–Schulz silently emits degenerate updates → training stalls while the forward/loss look completely normal, which is very hard to diagnose.Fixes #229