abstract: "What distinguishes modern adaptive methods from gradient descent to favor better generalizing solutions? To study this question for steepest-descent methods, including sign descent (an optimizer closely related to Adam), we introduce steepest mirror flows as a unifying theoretical framework. This enables us to analyze how optimization geometry governs learning dynamics, implicit bias, and sparsity. It also suggests a mechanism that may help explain why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations, we show that steeper descent promotes saddle-point escape. By contrast, gradient descent typically requires much larger learning rates to escape saddles—regimes that are less common in fine-tuning practice. Furthermore, we find that decoupled weight decay, as in AdamW, stabilizes sparse training by enforcing novel balance equations. Empirical experiments establish that our theoretical insights and hypothesized mechanisms transfer to realistic settings. Together, these results identify two mechanisms through which steepest descent can benefit modern optimization: saddle escape and sparsity."
0 commit comments