Playing Super Mario Bros with Proximal Policy Optimization (PPO)
My PyTorch Proximal Policy Optimization (PPO) implement to playing Super Mario Bros (This is PPO paper).
I tried implementing A2C to train the agent to play Super Mario Bros game. But A2C only completed 26/32 stages and I saw some other people completed 31/32 stages with PPO so I implemented PPO to play Mario. Implementing it myself also helps me better understand the algorithm.
You can use my notebook for training and testing agent very easy:
- Train your model by running all cell before session test
- Test your trained model by running all cell except agent.train(), just pass your model path to agent.load_model(model_path)
Or you can use train.py and test.py if you don't want to use notebook:
- Train your model by running train.py: For example training for stage 1-4: python train.py --world 1 --stage 4 --num_envs 8
- Test your trained model by running test.py: For example testing for stage 1-4: python test.py --world 1 --stage 4 --pretrained_model best_model.pth --num_envs 2
You can find trained model in folder trained_model
Default parameter set:
- Because I adjust the parameters for each stage, the optimal set of parameters that I recommend below is different from the parameters I train for the stages in the table below.
- I don't have enough time and resources to test again, however I ran 2 stages such as 1-1, 1-4 and found that the default parameters I recommended were good enough.
- RL is very sensitive to hyper parameters, so difficult stages like 5-3, 8-1 will require special parameter sets, the default parameters are for reference only.
- num_envs: 8 (only need 16 for hard stages)
- learn_step: 512 (to reach full episode)
- batchsize: 64 (Best when experimenting)
- epoch: 10 (not effect)
- gamma: 0.9 (only need 0.99 for special stages)
- learning_rate: 7e-5 (not effect too much, batchsize and learning_rate interact with each other like any other type of ML. However, when testing, I found batchsize had a stronger impact and a few tests of increasing or decreasing learning showed no difference.)
- target_kl: 0.05 (Best when experimenting)
- norm_adv: False (Best when experimenting, norm advantage helps the agent learn faster, but I feel like it makes the model worse in difficult stages.)
- loss_type: mse (Huber is needed for some difficult stages like 5-3, 8-1 but for many stages, mse or huber is not much different. I think loss type also interacts with other hyperparameters, you can experiment further.)
- gae_lambda: 0.95 (I don't tuning this parameter)
How did I find the hyperparameters for each stage:
- First, I combined the default parameters from my PPO stable baselines 3 and A2C: stage 1-1 hyperparameters (I found num_envs to be 8 good enough for PPO while A2C requires 16 to be good)
- Every time training fails, I adjust the parameters, initially I just adjust them randomly: learning_rate between 1e-5 and 1e-3, use norm advantage or not, gamma 0.9 or 0.99, learn step is 256 or 512 , batch size is 64 or 256.
- Through many stages, I recognized important parameters and understood that some stages needed to adjust a few parameters.
- learn_step should always be 512 because it ensures the agent will see the entire episode (256 is shorter than the number of epsiodes) especially for long episodes (I choose 512 as default).
- norm_adv is usually not useful, sometimes it helps train faster (by default I won't need norm_adv)
- loss_type is mse would be better, I find that unlike DQL, value network does not need stability from huber, mse is enough (default will be mse). However, huber is needed for some stages like 5-3, I'm not sure about the interaction between huber loss, mse loss and other hyper parameters!
- gamma is 0.9 for easy stages, when training fails with 0.9, try with 0.99 (harder stages often need gamma = 0.99)
- learning_rate does not affect too much (should be set to 7e-5)
- epoch have no effect
- target_kl is importance to make agent learn stablize. But I can completed 30/31 stages (except 5-3) without target_kl. You can see that 30 stages used target_kl is None. But I can't tuning hyperparameters to complete stage 5-3. Than I use target_kl and find that with target_kl = 0.05, I can completed stage 5-3. A higher target_kl does not work because it is not strong enough to make the model stable. A smaller target_kl causes the model to learn very slowly, or even not learn at all. Of course, this parameter needs to be further tested and it is also possible that this parameter depends on other hyper parameters or depends on each specific stage.
| World | Stage | num_envs | learn_step | batchsize | epoch | gamma | learning_rate | target_kl | norm_adv | loss_type | training_step | training_time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 55975 | 0:37:39 |
| 1 | 2 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 387965 | 3:00:54 |
| 1 | 3 | 16 | 512 | 64 | 10 | 0.99 | 1e-4 | None | False | mse | 262984 | 5:08:55 |
| 1 | 4 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 19969 | 0:14:20 |
| 2 | 1 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 1220983 | 7:39:52 |
| 2 | 2 | 8 | 512 | 256 | 10 | 0.9 | 1e-4 | None | False | huber | 1311983 | 11:23:30 |
| 2 | 3 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 103997 | 1:01:56 |
| 2 | 4 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 264986 | 2:11:13 |
| 3 | 1 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | False | huber | 540992 | 5:39:07 |
| 3 | 2 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 59981 | 0:41:55 |
| 3 | 3 | 16 | 512 | 256 | 10 | 0.99 | 1e-4 | None | False | huber | 65994 | 0:44:40 |
| 3 | 4 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 28992 | 0:21:35 |
| 4 | 1 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 84996 | 0:53:47 |
| 4 | 2 | 16 | 512 | 64 | 10 | 0.99 | 7e-5 | None | False | mse | 390654 | 9:42:05 |
| 4 | 3 | 16 | 512 | 64 | 10 | 0.99 | 7e-5 | None | False | mse | 73968 | 1:27:08 |
| 4 | 4 | 16 | 512 | 256 | 10 | 0.99 | 1e-4 | None | True | mse | 227983 | 4:17:04 |
| 5 | 1 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 111944 | 1:22:04 |
| 5 | 2 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | False | huber | 468979 | 4:54:56 |
| 5 | 3 | 16 | 512 | 64 | 10 | 0.99 | 7e-5 | 0.05 | False | huber | 613996 | 7:56:56 |
| 5 | 4 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 648972 | 6:05:24 |
| 6 | 1 | 8 | 512 | 256 | 10 | 0.9 | 1e-4 | None | False | huber | 159955 | 1:14:31 |
| 6 | 2 | 8 | 512 | 256 | 10 | 0.9 | 1e-4 | None | False | huber | 1165994 | 10:29:06 |
| 6 | 3 | 16 | 512 | 64 | 10 | 0.99 | 7e-5 | None | False | mse | 151961 | 2:49:34 |
| 6 | 4 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 94996 | 0:53:33 |
| 7 | 1 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 280000 | 2:44:50 |
| 7 | 2 | 8 | 512 | 256 | 10 | 0.9 | 1e-4 | None | False | huber | 2951967 | 1 day, 1:14:32 |
| 7 | 3 | 8 | 256 | 64 | 10 | 0.9 | 1e-4 | None | True | huber | 491992 | 4:37:12 |
| 7 | 4 | 16 | 512 | 256 | 10 | 0.9 | 1e-4 | None | False | mse | 169994 | 3:16:36 |
| 8 | 1 | 16 | 512 | 256 | 10 | 0.9 | 1e-4 | None | False | huber | 1450982 | 13:09:19 |
| 8 | 2 | 16 | 512 | 256 | 10 | 0.9 | 1e-4 | None | True | huber | 699985 | 7:44:43 |
| 8 | 3 | 16 | 512 | 256 | 10 | 0.9 | 1e-4 | None | True | huber | 1964979 | 19:12:40 |
-
Is this code guaranteed to complete the stages if you try training?
- This hyperparameter does not guarantee you will complete the stage. But I am sure that you can win with this hyperparameter except you have a unlucky day (need 2-3 times to win because of randomness)
-
How long do you train agents?
- Within a few hours to more than 1 day. Time depends on hardware, I use many different hardware so time will not be accurate.
-
How can you improve this code?
- You can separate the test agent part into a separate thread or process. I'm not good at multi-threaded programming so I don't do this.
-
Compare with A2C?
- It can be clearly seen that PPO is better, it can complete more difficult stages and learn how to go left to win (stages 4-4).
- Training time with PPO is generally significantly reduced because PPO is a more powerful algorithm, can work with 8 environments and usually does not reduce performance.
- PPO also has more hyperparameters so it requires more tuning or experience compared to A2C.
-
Why still not complete stage 8-4:
- First, stage 8-4 has a loop so it requires custom environment like stages 4-4 and 7-4.
- second, stage 8-4 is very long, it is the longest stage.
- Third, stage 8-4 requires finding hidden brick to get on the right path. It requires an algorithm to help explore the environment better.
- python 3>3.6
- gym==0.25.2
- gym-super-mario-bros==7.4.0
- imageio
- imageio-ffmpeg
- cv2
- pytorch
- numpy
With my code, I can completed 31/32 stages of Super Mario Bros. The lastest stage (8-4) is very hard, it is longest stage and agent need to find hidden brick to win than ppo can't complete this.































