Skip to content

Locoprop-S and Newton-Muon support via TE, Megatron-LM changes. #215

Description

@plugyawn

Is your feature request related to a problem? Please describe.

Newton-Muon and Locoprop are part of an evolving line of optimizers that study right-preconditioning on the gradient. However, a strong limitation has been that they require activation info. in the optimization step.
However,

  • Newton-Muon can be expressed as $msign(G f(C))$, where C is the feature gram matrix. (NM chooses $f(C)$ to be $C^{-1}$ but it could be something else, from what I understand).
  • Locoprop can also be expressed purely as a function of $(G,C)$ through the following construction:
Image

Hence, by adding optional support to route the _feature_gram_ ($X^T X$) beside the main_grad ($dY^TX$), via a series of not-very-invasive changes into TransformerEngine, Megatron-LM, and Emerging-Optimizers, with the possibility for a beautiful abstraction:
Image

Describe the solution you'd like

I have a functioning version of the solution with TE, Megatron-LM, EO forks pinned at Megaprop.

On a TP=2 sweep on a Megatron GPT over a FineWeb Edu set, 2000 steps, two sweeps, with the feature gram matrix refreshed every 8 steps, we have:

Image

It's wasteful to materialize the full gram matrix, so with Locoprop I tried a diagonal approximation and a block diagonal approximation.
The diagonal approximation seems to do well! NM appears to be slower at the moment due to the polar iteration maybe, I need to check.

Image

The initial speed was off for Locoprop too initially, but Codex was able to write some basic kernels very quickly.

I think there should be a few more AdamW LRs checked, but the initial results look promising, and not streaming the activations seem to work. I double-checked to make sure that the calculations come out to be equivalent.

I've also attached a design doc here for reference.
The diff excl. test files is not that huge:

repo                 | non-test diff
---------------------|-----------------------
Megatron-LM          | 13 files, +1336/-2
Emerging-Optimizers  | 6 files, +1140/-3
TransformerEngine    | 5 files, +281/-0
Total                | 24 files, +2757/-5

CC and thanks to: @mkhona-nvidia for his help!

feature_gram_matrix_optimizers_design.pdf

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions