Add the actor-critic algorithm

Can do this only inside the update function with the help of an additional value function which learns to approximate rewards. First update value function using a simple loss, then update  model.