Reinforcement Learning: Lunar Lander

Lunar Lander environment by openAI's gym (https://gym.openai.com/envs/LunarLander-v2/) solved using 3 different algorithms: DQN, DDPG and PPO.

This project belongs to the course EL2805 Reinforcement Learning at KTH.

Deep Q-Network (DQN)

In this version of the problem we use a discrete action space with 4 actions (left, right, main engine or do nothing). Q-learning finds the optimal policy by learning the Q-function, aka (action, state)-value function. The Q-function returns the expected future reward if action a is selected from state s.

However, due to the very large state space of the Lunar Lander environment, learning Q(s,a) for every state s is not feasible. Instead, in DQN we approximate the Q-function with a neural network, which takes the state as input and returns the Q-value for each of the 4 actions. The weights and biases of the NN are the parameters being learned, and not the actual Q-function for every state. This reduces the complexity of the algorithm.

Deep Deterministic Policy Gradient (DDPG)

Now we switch to a version fo the problem with a continuous actions space, which is more realistic and is the usual scenario of most cyber-physical systems such as automonous vehicles.

With a continuous action space, we can no longer use DQN. This is due to the very large complexity of comparing Q-values for all possible actions when dealing with continuous actions. The solution is to use actor-critic algorithms in which we estimate both the critic (Q-function) and the actor (policy). Similarly to DQN, we have a NN to approximate Q-function, but now we also have another NN that approximates a deterministic policy. This policy is used to generate the samples and is periodically updated using the Deterministic Policy Gradient method.

Proximal Policy Optimization (PPO)

PPO is an attempt to improve the stability of policy gradient methods. It is also an actor-critic algorithm and we will work with a continuous action space.

PPO is a Trust Region algorithm, where the policy learned is a region of actions instead of being deterministic. In this case, the NN of the policy will return mean and variance given the state, and it denotes a gaussian random variable that will be used to select the action. To prevent sudden changes in the policy network, we bound the update with a clipping function.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
DDPG		DDPG
DQN		DQN
PPO		PPO
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning: Lunar Lander

Deep Q-Network (DQN)

Deep Deterministic Policy Gradient (DDPG)

Proximal Policy Optimization (PPO)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning: Lunar Lander

Deep Q-Network (DQN)

Deep Deterministic Policy Gradient (DDPG)

Proximal Policy Optimization (PPO)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages