MOSO: Combining SOAP and Muon #222
Conversation
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Greptile SummaryThis PR introduces MOSO (Momentum One-Sided SOAP), a new optimizer that combines Muon-style EMA momentum with one-sided SOAP preconditioning. It maintains an Adam update in the covariance eigenbasis of the Muon momentum matrix, restricted to the smaller of the two matrix dimensions.
Confidence Score: 4/5Safe to merge after addressing the shampoo_beta=1.0 NaN issue; all normal-range hyperparameter values work correctly. The optimizer correctly implements all described algorithmic pieces and has matching tests. The one concrete defect is in the shampoo_beta bias-correction formula: passing shampoo_beta=1.0 produces a 0/0 NaN that silently corrupts state["M"] and all downstream state for the rest of training. emerging_optimizers/soap/moso.py (the shampoo_beta bias-correction formula at line 156) Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant MOSO
participant MomentumFactor as _update_one_sided_momentum_factor
participant EigUpdate as _update_eigenbasis_and_adam_exp_avgs
participant Adam as calculate_adam_update
Caller->>MOSO: step()
MOSO->>MOSO: apply weight decay (decoupled)
MOSO->>MOSO: "lerp momentum_buffer <- grad (Muon EMA)"
MOSO->>MomentumFactor: update M (one-sided covariance of momentum)
MomentumFactor-->>MOSO: M updated in-place
MOSO->>EigUpdate: rotate exp_avg to new basis, permute exp_avg_sq, update Q_M
Note over EigUpdate: step==0 uses eigh(M), step>0 uses orthogonal iteration
EigUpdate-->>MOSO: (Q_M, exp_avg_in_new_basis, permuted_exp_avg_sq)
MOSO->>MOSO: project momentum into Q_M basis
MOSO->>Adam: calculate_adam_update(projected_momentum, exp_avg, exp_avg_sq)
Adam-->>MOSO: adam_update (in Q_M basis)
MOSO->>MOSO: project adam_update back to parameter space
MOSO->>MOSO: clip RMS (optional)
MOSO->>Caller: "p <- p - lr * update"
Reviews (7): Last reviewed commit: "Merge branch 'main' into mkhona/shmuon" | Re-trigger Greptile |
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
skyw
left a comment
There was a problem hiding this comment.
Other than bit DRY, which is not this PR's problem, mostly ok. will approve after.
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
|
/ok to test 1646e65 |
MOSO
MOSO, short for Momentum One-Sided SOAP, combines Muon-style momentum with SOAP's eigenbasis Adam update, but keeps the preconditioner one-sided on the smaller matrix dimension. For a momentum matrix (M_t), MOSO accumulates a SOAP-style covariance over momentum instead of raw gradients, using$(C_t = \beta_s C_{t-1} + (1 - \beta_s) M_t M_t^T)$ for the left-preconditioned case, or $(C_t = \beta_s C_{t-1} + (1 - \beta_s) M_t^T M_t)$ for the right-preconditioned case. With $(C_t = Q_M \Lambda_M Q_M^T)$ , the left-side update is
with the analogous right-side projection$(U_t = \text{Adam}(M_t Q_M) Q_M^T)$ when the column dimension is smaller. This can be read as one-sided SOAP on Muon momentum: rotate $(M_t)$ into the momentum-covariance eigenbasis, run the inner Adam update there, and rotate back.