Is your feature request related to a problem? Please describe.
Newton-Muon and Locoprop are part of an evolving line of optimizers that study right-preconditioning on the gradient. However, a strong limitation has been that they require activation info. in the optimization step.
However,
- Newton-Muon can be expressed as $msign(G f(C))$, where C is the feature gram matrix. (NM chooses $f(C)$ to be $C^{-1}$ but it could be something else, from what I understand).
- Locoprop can also be expressed purely as a function of $(G,C)$ through the following construction:
Hence, by adding optional support to route the _feature_gram_ ($X^T X$) beside the main_grad ($dY^TX$), via a series of not-very-invasive changes into TransformerEngine, Megatron-LM, and Emerging-Optimizers, with the possibility for a beautiful abstraction:

Describe the solution you'd like
I have a functioning version of the solution with TE, Megatron-LM, EO forks pinned at Megaprop.
On a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps, we have:
It's wasteful to materialize the full gram matrix, so with Locoprop I tried a diagonal approximation and a block diagonal approximation.
The diagonal approximation seems to do well! NM appears to be slower at the moment due to the polar iteration maybe, I need to check.
The initial speed was off for Locoprop too initially, but Codex was able to write some basic kernels very quickly.
I think there should be a few more AdamW LRs checked, but the initial results look promising, and not streaming the activations seem to work. I double-checked to make sure that the calculations come out to be equivalent.
I've also attached a design doc here for reference.
The diff excl. test files is not that huge:
repo | non-test diff
---------------------|-----------------------
Megatron-LM | 13 files, +1336/-2
Emerging-Optimizers | 6 files, +1140/-3
TransformerEngine | 5 files, +281/-0
Total | 24 files, +2757/-5
CC and thanks to: @mkhona-nvidia for his help!
feature_gram_matrix_optimizers_design.pdf
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Newton-Muon and Locoprop are part of an evolving line of optimizers that study right-preconditioning on the gradient. However, a strong limitation has been that they require activation info. in the optimization step.
However,
Hence, by adding optional support to route the$X^T X$ ) beside the $dY^TX$ ), via a series of not-very-invasive changes into TransformerEngine, Megatron-LM, and Emerging-Optimizers, with the possibility for a beautiful abstraction:

_feature_gram_(main_grad(Describe the solution you'd like
I have a functioning version of the solution with TE, Megatron-LM, EO forks pinned at Megaprop.
On a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps, we have:
It's wasteful to materialize the full gram matrix, so with Locoprop I tried a diagonal approximation and a block diagonal approximation.
The diagonal approximation seems to do well! NM appears to be slower at the moment due to the polar iteration maybe, I need to check.
The initial speed was off for Locoprop too initially, but Codex was able to write some basic kernels very quickly.
I think there should be a few more AdamW LRs checked, but the initial results look promising, and not streaming the activations seem to work. I double-checked to make sure that the calculations come out to be equivalent.
I've also attached a design doc here for reference.
The diff excl. test files is not that huge:
CC and thanks to: @mkhona-nvidia for his help!
feature_gram_matrix_optimizers_design.pdf
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.