Locoprop-S and Newton-Muon support via TE, Megatron-LM changes.

## **Is your feature request related to a problem? Please describe.**
Newton-Muon and Locoprop are part of an evolving line of optimizers that study right-preconditioning on the gradient. However, a strong limitation has been that they require activation info. in the optimization step. 
However, 
- Newton-Muon can be expressed as $msign(G f(C))$, where C is the feature gram matrix. (NM chooses $f(C)$ to be $C^{-1}$ but it could be something else, from what I understand).
- Locoprop can also be expressed purely as a function of $(G,C)$ through the following construction:

<img width="611" height="585" alt="Image" src="https://github.com/user-attachments/assets/81aac8a8-7711-4595-861d-78aa8e55cb5c" />

Hence, by adding optional support  to route the  `_feature_gram_` ($X^T X$) _*beside*_ the `main_grad` ($dY^TX$), via a series of not-very-invasive changes into TransformerEngine, Megatron-LM, and Emerging-Optimizers, with the possibility for a beautiful abstraction:
<img width="539" height="166" alt="Image" src="https://github.com/user-attachments/assets/7304e3ac-00d3-495e-917e-e7efa9fdcb59" />

## **Describe the solution you'd like**
I have a functioning version of the solution with TE, Megatron-LM, EO forks pinned at [Megaprop](https://github.com/plugyawn/megaprop). 

On a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps, we have:

<img width="3252" height="1942" alt="Image" src="https://github.com/user-attachments/assets/5f49d908-ad78-47ef-bf85-7aa369257647" />

It's wasteful to materialize the full gram matrix, so with Locoprop I tried a diagonal approximation and a block diagonal approximation. 
The diagonal approximation seems to do well! NM appears to be slower at the moment due to the polar iteration maybe, I need to check.

<img width="3080" height="1928" alt="Image" src="https://github.com/user-attachments/assets/0212eeed-3dea-403c-b7c5-63820cbe5658" />

The initial speed was off for Locoprop too initially, but Codex was able to write some basic kernels very quickly. 

I think there should be a few more AdamW LRs checked, but the initial results look promising, and not streaming the activations seem to work. I double-checked to make sure that the calculations come out to be equivalent.

I've also attached a design doc here for reference. 
The diff excl. test files is not that huge:
```
repo                 | non-test diff
---------------------|-----------------------
Megatron-LM          | 13 files, +1336/-2
Emerging-Optimizers  | 6 files, +1140/-3
TransformerEngine    | 5 files, +281/-0
Total                | 24 files, +2757/-5
```

CC and thanks to: @mkhona-nvidia for his help!

[feature_gram_matrix_optimizers_design.pdf](https://github.com/user-attachments/files/28427984/feature_gram_matrix_optimizers_design.1.pdf)



**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Locoprop-S and Newton-Muon support via TE, Megatron-LM changes. #215

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Locoprop-S and Newton-Muon support via TE, Megatron-LM changes. #215

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions