Implementing Optimizers

**Is your feature request related to a problem? Please describe.**
The current implementation of optimizers does not fairly generalize to neural networks.
Reason: They are hard-coded to work with a single parameter at a time and so are most of the loss functions.

**Describe the solution you'd like**
We should implement the pre-existing optimizers from scratch (not really scratch though) in the folder optim by subclassing the base `Optimizer` with their sole purpose being updating the parameters passed to them. Most (read: almost all) optimizers are already implemented inside [`optimizers.py`](https://github.com/RoboticsClubIITJ/ML-DL-implementation/blob/master/MLlib/optimizers.py) so they could be used as reference for understanding how a core mathematical idea translates to code.
But about gradients then? autograd will handle them.

**Describe alternatives you've considered**
We can modify the loss functions to return the gradients manually to the optimizer for all parameters.. but this will require further abstraction of the `iterate(...)` method of optimizers and modifying the loss functions in such a way would be very time and resource consuming that it might not be worth the effort because we already have automatic differentiation in place to compute those derivatives.

**Approach to be followed**

1. Read the available implementations in `optim` folder. Here's a description of what's been done in case of `SGD`
      - The `__init__(...)` function accepts a python iterable object which contains the parameters to be updated and learning rate as `lr`.
      -  The `step(...)` method iterates over all the parameters which were passed during the instantiation of the optimizer and performs the update according to the rule `P = P - learning_rate * dL/dP` where L is the final loss and P is a parameter.
 

2. Here's a list of (most of the) optimizing algorithms used for Neural Networks: https://www.kdnuggets.com/2019/06/gradient-descent-algorithms-cheat-sheet.html. You can read the mathematical equations listed here and what if better intuitive understanding is desired? YouTube!

3. By now, it must be clear that every optimizer is not as simple as SGD. So how they should actually be implemented?
    - The `__init__(...)` function:
        - Must accept _parameters_ and a _learning_rate_ obviously.
        - All _hyper-parameters_ must be initialized/taken into account here. For example, in case of SGDWithMomentum we require a hyper-parameter _beta_ which should be accepted in `__init__(...)` and stored in class instance in form of an attribute because we need it later to perform our update; similarly, for Adam we will be needing _beta1_, _beta2_ and _epsilon_.
        - About weight decays and stuff? Sure, but they are not the focus here.. once all the optimizers are successfully ported we can have a separate issue regarding these.
        - The "Extra Super Important Variables" which help to perform the update should also be initialized here. For example, for SGDWithMomentum, we need to initialize _V_ for every parameter to zero and then update and use the _V_ corresponding to different parameters to update them; similarly, for Adam we need to initialize `S_t` and `V_t` in a similar fashion to perform the desired update on  parameters.

    - The `step(...)` function:
        - Performs the gradient update. Or to put it simply, this is the arena of mathematical equations used to update the parameters. The gradient with respect to the final loss are stored in _param.grad_.
        - The actual update would be performed on _param.data_ instead of  _param_ using _param.gradient.data_ and not _param.gradient_.. why so? Well the actual _data_ is stored in `data` attribute of Tensor. When we perform operations on Tensor, they get added to the computation graph or to put it simply, these operations are recorded to be used during back-propagation. Since we are just _updating_ the `param` we don't want it to be used during back-propagation, so we perform the operation directly on _param.data_ and _param.grad.data_ (_param_ and _param.grad_ are both Tensors).
        - NOTE: The Cheat Sheet at the bottom of the page (mentioned at top) lists the equations in reverse order.

4. A NOTE ABOUT NESTEROV ACCELERATED GRADIENT algorithm:
    - Most libraries implement this by storing the _look-ahead_ weights so that the update looks more like Gradient Descent, we can also follow this route. If this didn't make sense "Don't worry about it!" 

**Additional context**
Neural Nets depend entirely on optimizers for learning, having various optimizers available at hand would be great feat for the library.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Optimizers #166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementing Optimizers #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions