Is your feature request related to a problem? Please describe.
The current implementation of optimizers does not fairly generalize to neural networks.
Reason: They are hard-coded to work with a single parameter at a time and so are most of the loss functions.
Describe the solution you'd like
We should implement the pre-existing optimizers from scratch (not really scratch though) in the folder optim by subclassing the base Optimizer with their sole purpose being updating the parameters passed to them. Most (read: almost all) optimizers are already implemented inside optimizers.py so they could be used as reference for understanding how a core mathematical idea translates to code.
But about gradients then? autograd will handle them.
Describe alternatives you've considered
We can modify the loss functions to return the gradients manually to the optimizer for all parameters.. but this will require further abstraction of the iterate(...) method of optimizers and modifying the loss functions in such a way would be very time and resource consuming that it might not be worth the effort because we already have automatic differentiation in place to compute those derivatives.
Approach to be followed
-
Read the available implementations in optim folder. Here's a description of what's been done in case of SGD
- The
__init__(...) function accepts a python iterable object which contains the parameters to be updated and learning rate as lr.
- The
step(...) method iterates over all the parameters which were passed during the instantiation of the optimizer and performs the update according to the rule P = P - learning_rate * dL/dP where L is the final loss and P is a parameter.
-
Here's a list of (most of the) optimizing algorithms used for Neural Networks: https://www.kdnuggets.com/2019/06/gradient-descent-algorithms-cheat-sheet.html. You can read the mathematical equations listed here and what if better intuitive understanding is desired? YouTube!
-
By now, it must be clear that every optimizer is not as simple as SGD. So how they should actually be implemented?
-
A NOTE ABOUT NESTEROV ACCELERATED GRADIENT algorithm:
- Most libraries implement this by storing the look-ahead weights so that the update looks more like Gradient Descent, we can also follow this route. If this didn't make sense "Don't worry about it!"
Additional context
Neural Nets depend entirely on optimizers for learning, having various optimizers available at hand would be great feat for the library.
Is your feature request related to a problem? Please describe.
The current implementation of optimizers does not fairly generalize to neural networks.
Reason: They are hard-coded to work with a single parameter at a time and so are most of the loss functions.
Describe the solution you'd like
We should implement the pre-existing optimizers from scratch (not really scratch though) in the folder optim by subclassing the base
Optimizerwith their sole purpose being updating the parameters passed to them. Most (read: almost all) optimizers are already implemented insideoptimizers.pyso they could be used as reference for understanding how a core mathematical idea translates to code.But about gradients then? autograd will handle them.
Describe alternatives you've considered
We can modify the loss functions to return the gradients manually to the optimizer for all parameters.. but this will require further abstraction of the
iterate(...)method of optimizers and modifying the loss functions in such a way would be very time and resource consuming that it might not be worth the effort because we already have automatic differentiation in place to compute those derivatives.Approach to be followed
Read the available implementations in
optimfolder. Here's a description of what's been done in case ofSGD__init__(...)function accepts a python iterable object which contains the parameters to be updated and learning rate aslr.step(...)method iterates over all the parameters which were passed during the instantiation of the optimizer and performs the update according to the ruleP = P - learning_rate * dL/dPwhere L is the final loss and P is a parameter.Here's a list of (most of the) optimizing algorithms used for Neural Networks: https://www.kdnuggets.com/2019/06/gradient-descent-algorithms-cheat-sheet.html. You can read the mathematical equations listed here and what if better intuitive understanding is desired? YouTube!
By now, it must be clear that every optimizer is not as simple as SGD. So how they should actually be implemented?
The
__init__(...)function:__init__(...)and stored in class instance in form of an attribute because we need it later to perform our update; similarly, for Adam we will be needing beta1, beta2 and epsilon.S_tandV_tin a similar fashion to perform the desired update on parameters.The
step(...)function:dataattribute of Tensor. When we perform operations on Tensor, they get added to the computation graph or to put it simply, these operations are recorded to be used during back-propagation. Since we are just updating theparamwe don't want it to be used during back-propagation, so we perform the operation directly on param.data and param.grad.data (param and param.grad are both Tensors).A NOTE ABOUT NESTEROV ACCELERATED GRADIENT algorithm:
Additional context
Neural Nets depend entirely on optimizers for learning, having various optimizers available at hand would be great feat for the library.