Skip to content

MOSO: Combining SOAP and Muon #222

Merged
mkhona-nvidia merged 9 commits into
NVIDIA-NeMo:mainfrom
mkhona-nvidia:mkhona/shmuon
Jun 4, 2026
Merged

MOSO: Combining SOAP and Muon #222
mkhona-nvidia merged 9 commits into
NVIDIA-NeMo:mainfrom
mkhona-nvidia:mkhona/shmuon

Conversation

@mkhona-nvidia

@mkhona-nvidia mkhona-nvidia commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

MOSO

MOSO, short for Momentum One-Sided SOAP, combines Muon-style momentum with SOAP's eigenbasis Adam update, but keeps the preconditioner one-sided on the smaller matrix dimension. For a momentum matrix (M_t), MOSO accumulates a SOAP-style covariance over momentum instead of raw gradients, using $(C_t = \beta_s C_{t-1} + (1 - \beta_s) M_t M_t^T)$ for the left-preconditioned case, or $(C_t = \beta_s C_{t-1} + (1 - \beta_s) M_t^T M_t)$ for the right-preconditioned case. With $(C_t = Q_M \Lambda_M Q_M^T)$, the left-side update is

$$U_t = Q_M \text{Adam}(Q_M^T M_t),$$

with the analogous right-side projection $(U_t = \text{Adam}(M_t Q_M) Q_M^T)$ when the column dimension is smaller. This can be read as one-sided SOAP on Muon momentum: rotate $(M_t)$ into the momentum-covariance eigenbasis, run the inner Adam update there, and rotate back.

Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
@mkhona-nvidia mkhona-nvidia requested a review from skyw June 3, 2026 19:28
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces MOSO (Momentum One-Sided SOAP), a new optimizer that combines Muon-style EMA momentum with one-sided SOAP preconditioning. It maintains an Adam update in the covariance eigenbasis of the Muon momentum matrix, restricted to the smaller of the two matrix dimensions.

  • moso.py implements the full MOSO optimizer: Muon momentum EMA, bias-corrected one-sided Shampoo covariance, eigenbasis update (full eigh at step 0, orthogonal iteration thereafter), Adam-state basis-change rotation for exp_avg, and permutation of exp_avg_sq on eigenvalue sort.
  • tests/test_moso.py covers smoke steps for multiple shapes, registry lookup, covariance accumulation on the smaller side, and a closed-form equivalence check for the no-EMA case.

Confidence Score: 4/5

Safe to merge after addressing the shampoo_beta=1.0 NaN issue; all normal-range hyperparameter values work correctly.

The optimizer correctly implements all described algorithmic pieces and has matching tests. The one concrete defect is in the shampoo_beta bias-correction formula: passing shampoo_beta=1.0 produces a 0/0 NaN that silently corrupts state["M"] and all downstream state for the rest of training.

emerging_optimizers/soap/moso.py (the shampoo_beta bias-correction formula at line 156)

Important Files Changed

Filename Overview
emerging_optimizers/soap/moso.py Core MOSO optimizer; correctly structured but the shampoo_beta bias-correction formula silently produces NaN when shampoo_beta=1.0.
tests/test_moso.py Tests cover smoke, registry, covariance shape, and closed-form equivalence.
emerging_optimizers/soap/init.py Trivial registration of MOSO in the soap subpackage.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant MOSO
    participant MomentumFactor as _update_one_sided_momentum_factor
    participant EigUpdate as _update_eigenbasis_and_adam_exp_avgs
    participant Adam as calculate_adam_update

    Caller->>MOSO: step()
    MOSO->>MOSO: apply weight decay (decoupled)
    MOSO->>MOSO: "lerp momentum_buffer <- grad (Muon EMA)"
    MOSO->>MomentumFactor: update M (one-sided covariance of momentum)
    MomentumFactor-->>MOSO: M updated in-place
    MOSO->>EigUpdate: rotate exp_avg to new basis, permute exp_avg_sq, update Q_M
    Note over EigUpdate: step==0 uses eigh(M), step>0 uses orthogonal iteration
    EigUpdate-->>MOSO: (Q_M, exp_avg_in_new_basis, permuted_exp_avg_sq)
    MOSO->>MOSO: project momentum into Q_M basis
    MOSO->>Adam: calculate_adam_update(projected_momentum, exp_avg, exp_avg_sq)
    Adam-->>MOSO: adam_update (in Q_M basis)
    MOSO->>MOSO: project adam_update back to parameter space
    MOSO->>MOSO: clip RMS (optional)
    MOSO->>Caller: "p <- p - lr * update"
Loading

Reviews (7): Last reviewed commit: "Merge branch 'main' into mkhona/shmuon" | Re-trigger Greptile

Comment thread emerging_optimizers/soap/sh_muon.py Outdated
Signed-off-by: mikail <mkhona@nvidia.com>
@mkhona-nvidia mkhona-nvidia changed the title ShMuon: Combining SOAP and Muon MOSO: Combining SOAP and Muon Jun 3, 2026
Signed-off-by: mikail <mkhona@nvidia.com>

@skyw skyw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than bit DRY, which is not this PR's problem, mostly ok. will approve after.

Comment thread emerging_optimizers/soap/moso.py
Comment thread emerging_optimizers/soap/moso.py Outdated
Comment thread emerging_optimizers/soap/moso.py Outdated
Comment thread emerging_optimizers/soap/sh_muon.py Outdated
Comment thread tests/test_sh_muon.py Outdated
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
@mkhona-nvidia mkhona-nvidia enabled auto-merge (squash) June 4, 2026 21:36
@mkhona-nvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 1646e65

Comment thread emerging_optimizers/soap/moso.py
@mkhona-nvidia mkhona-nvidia merged commit a0e376b into NVIDIA-NeMo:main Jun 4, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants