Mario_PPO

Playing Super Mario Bros with Proximal Policy Optimization (PPO)

Introduction

My PyTorch Proximal Policy Optimization (PPO) implement to playing Super Mario Bros (This is PPO paper).

Results

Motivation

I tried implementing A2C to train the agent to play Super Mario Bros game. But A2C only completed 26/32 stages and I saw some other people completed 31/32 stages with PPO so I implemented PPO to play Mario. Implementing it myself also helps me better understand the algorithm.

How to use it

You can use my notebook for training and testing agent very easy:

Train your model by running all cell before session test
Test your trained model by running all cell except agent.train(), just pass your model path to agent.load_model(model_path)

Or you can use train.py and test.py if you don't want to use notebook:

Train your model by running train.py: For example training for stage 1-4: python train.py --world 1 --stage 4 --num_envs 8
Test your trained model by running test.py: For example testing for stage 1-4: python test.py --world 1 --stage 4 --pretrained_model best_model.pth --num_envs 2

Trained models

You can find trained model in folder trained_model

Hyperparameters

Default parameter set:

Because I adjust the parameters for each stage, the optimal set of parameters that I recommend below is different from the parameters I train for the stages in the table below.
I don't have enough time and resources to test again, however I ran 2 stages such as 1-1, 1-4 and found that the default parameters I recommended were good enough.
RL is very sensitive to hyper parameters, so difficult stages like 5-3, 8-1 will require special parameter sets, the default parameters are for reference only.
num_envs: 8 (only need 16 for hard stages)
learn_step: 512 (to reach full episode)
batchsize: 64 (Best when experimenting)
epoch: 10 (not effect)
gamma: 0.9 (only need 0.99 for special stages)
learning_rate: 7e-5 (not effect too much, batchsize and learning_rate interact with each other like any other type of ML. However, when testing, I found batchsize had a stronger impact and a few tests of increasing or decreasing learning showed no difference.)
target_kl: 0.05 (Best when experimenting)
norm_adv: False (Best when experimenting, norm advantage helps the agent learn faster, but I feel like it makes the model worse in difficult stages.)
loss_type: mse (Huber is needed for some difficult stages like 5-3, 8-1 but for many stages, mse or huber is not much different. I think loss type also interacts with other hyperparameters, you can experiment further.)
gae_lambda: 0.95 (I don't tuning this parameter)

How did I find the hyperparameters for each stage:

First, I combined the default parameters from my PPO stable baselines 3 and A2C: stage 1-1 hyperparameters (I found num_envs to be 8 good enough for PPO while A2C requires 16 to be good)
Every time training fails, I adjust the parameters, initially I just adjust them randomly: learning_rate between 1e-5 and 1e-3, use norm advantage or not, gamma 0.9 or 0.99, learn step is 256 or 512 , batch size is 64 or 256.
Through many stages, I recognized important parameters and understood that some stages needed to adjust a few parameters.
learn_step should always be 512 because it ensures the agent will see the entire episode (256 is shorter than the number of epsiodes) especially for long episodes (I choose 512 as default).
norm_adv is usually not useful, sometimes it helps train faster (by default I won't need norm_adv)
loss_type is mse would be better, I find that unlike DQL, value network does not need stability from huber, mse is enough (default will be mse). However, huber is needed for some stages like 5-3, I'm not sure about the interaction between huber loss, mse loss and other hyper parameters!
gamma is 0.9 for easy stages, when training fails with 0.9, try with 0.99 (harder stages often need gamma = 0.99)
learning_rate does not affect too much (should be set to 7e-5)
epoch have no effect
target_kl is importance to make agent learn stablize. But I can completed 30/31 stages (except 5-3) without target_kl. You can see that 30 stages used target_kl is None. But I can't tuning hyperparameters to complete stage 5-3. Than I use target_kl and find that with target_kl = 0.05, I can completed stage 5-3. A higher target_kl does not work because it is not strong enough to make the model stable. A smaller target_kl causes the model to learn very slowly, or even not learn at all. Of course, this parameter needs to be further tested and it is also possible that this parameter depends on other hyper parameters or depends on each specific stage.

World	Stage	num_envs	learn_step	batchsize	epoch	gamma	learning_rate	target_kl	norm_adv	loss_type	training_step	training_time
1	1	8	256	64	10	0.9	1e-4	None	True	huber	55975	0:37:39
1	2	8	256	64	10	0.9	1e-4	None	True	huber	387965	3:00:54
1	3	16	512	64	10	0.99	1e-4	None	False	mse	262984	5:08:55
1	4	8	256	64	10	0.9	1e-4	None	True	huber	19969	0:14:20
2	1	8	256	64	10	0.9	1e-4	None	True	huber	1220983	7:39:52
2	2	8	512	256	10	0.9	1e-4	None	False	huber	1311983	11:23:30
2	3	8	256	64	10	0.9	1e-4	None	True	huber	103997	1:01:56
2	4	8	256	64	10	0.9	1e-4	None	True	huber	264986	2:11:13
3	1	8	256	64	10	0.9	1e-4	None	False	huber	540992	5:39:07
3	2	8	256	64	10	0.9	1e-4	None	True	huber	59981	0:41:55
3	3	16	512	256	10	0.99	1e-4	None	False	huber	65994	0:44:40
3	4	8	256	64	10	0.9	1e-4	None	True	huber	28992	0:21:35
4	1	8	256	64	10	0.9	1e-4	None	True	huber	84996	0:53:47
4	2	16	512	64	10	0.99	7e-5	None	False	mse	390654	9:42:05
4	3	16	512	64	10	0.99	7e-5	None	False	mse	73968	1:27:08
4	4	16	512	256	10	0.99	1e-4	None	True	mse	227983	4:17:04
5	1	8	256	64	10	0.9	1e-4	None	True	huber	111944	1:22:04
5	2	8	256	64	10	0.9	1e-4	None	False	huber	468979	4:54:56
5	3	16	512	64	10	0.99	7e-5	0.05	False	huber	613996	7:56:56
5	4	8	256	64	10	0.9	1e-4	None	True	huber	648972	6:05:24
6	1	8	512	256	10	0.9	1e-4	None	False	huber	159955	1:14:31
6	2	8	512	256	10	0.9	1e-4	None	False	huber	1165994	10:29:06
6	3	16	512	64	10	0.99	7e-5	None	False	mse	151961	2:49:34
6	4	8	256	64	10	0.9	1e-4	None	True	huber	94996	0:53:33
7	1	8	256	64	10	0.9	1e-4	None	True	huber	280000	2:44:50
7	2	8	512	256	10	0.9	1e-4	None	False	huber	2951967	1 day, 1:14:32
7	3	8	256	64	10	0.9	1e-4	None	True	huber	491992	4:37:12
7	4	16	512	256	10	0.9	1e-4	None	False	mse	169994	3:16:36
8	1	16	512	256	10	0.9	1e-4	None	False	huber	1450982	13:09:19
8	2	16	512	256	10	0.9	1e-4	None	True	huber	699985	7:44:43
8	3	16	512	256	10	0.9	1e-4	None	True	huber	1964979	19:12:40

Questions

Is this code guaranteed to complete the stages if you try training?
- This hyperparameter does not guarantee you will complete the stage. But I am sure that you can win with this hyperparameter except you have a unlucky day (need 2-3 times to win because of randomness)
How long do you train agents?
- Within a few hours to more than 1 day. Time depends on hardware, I use many different hardware so time will not be accurate.
How can you improve this code?
- You can separate the test agent part into a separate thread or process. I'm not good at multi-threaded programming so I don't do this.
Compare with A2C?
- It can be clearly seen that PPO is better, it can complete more difficult stages and learn how to go left to win (stages 4-4).
- Training time with PPO is generally significantly reduced because PPO is a more powerful algorithm, can work with 8 environments and usually does not reduce performance.
- PPO also has more hyperparameters so it requires more tuning or experience compared to A2C.
Why still not complete stage 8-4:
- First, stage 8-4 has a loop so it requires custom environment like stages 4-4 and 7-4.
- second, stage 8-4 is very long, it is the longest stage.
- Third, stage 8-4 requires finding hidden brick to get on the right path. It requires an algorithm to help explore the environment better.

Requirements

python 3>3.6
gym==0.25.2
gym-super-mario-bros==7.4.0
imageio
imageio-ffmpeg
cv2
pytorch
numpy

Acknowledgements

With my code, I can completed 31/32 stages of Super Mario Bros. The lastest stage (8-4) is very hard, it is longest stage and agent need to find hidden brick to win than ppo can't complete this.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
demo		demo
src		src
trained_model		trained_model
LICENSE		LICENSE
README.md		README.md
ppo_mario.ipynb		ppo_mario.ipynb
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mario_PPO

Introduction

Motivation

How to use it

Trained models

Hyperparameters

Questions

Requirements

Acknowledgements

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mario_PPO

Introduction

Motivation

How to use it

Trained models

Hyperparameters

Questions

Requirements

Acknowledgements

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages