diff --git a/__pycache__/share_func.cpython-310.pyc b/__pycache__/share_func.cpython-310.pyc new file mode 100644 index 0000000..815c6e2 Binary files /dev/null and b/__pycache__/share_func.cpython-310.pyc differ diff --git a/dqn/dqn.md b/dqn/dqn.md new file mode 100644 index 0000000..e69de29 diff --git a/dqn/dqn.py b/dqn/dqn.py new file mode 100644 index 0000000..19ee897 --- /dev/null +++ b/dqn/dqn.py @@ -0,0 +1,9 @@ +#!/usr/bin/env python +# -*- encoding: utf-8 -*- +''' +@File : dqn.py +@Time : 2024/09/20 10:01:00 +@Author : junewluo +@Email : overtheriver861@gmail.com +@description : implement of DQN FrameWork. +''' diff --git a/ppo/__pycache__/ppo.cpython-310.pyc b/ppo/__pycache__/ppo.cpython-310.pyc new file mode 100644 index 0000000..d3d954b Binary files /dev/null and b/ppo/__pycache__/ppo.cpython-310.pyc differ diff --git a/ppo/__pycache__/relaybuffer.cpython-310.pyc b/ppo/__pycache__/relaybuffer.cpython-310.pyc new file mode 100644 index 0000000..a44faf4 Binary files /dev/null and b/ppo/__pycache__/relaybuffer.cpython-310.pyc differ diff --git a/ppo/__pycache__/trick.cpython-310.pyc b/ppo/__pycache__/trick.cpython-310.pyc new file mode 100644 index 0000000..2fce3ab Binary files /dev/null and b/ppo/__pycache__/trick.cpython-310.pyc differ diff --git a/ppo/imgs/clip_func.png b/ppo/imgs/clip_func.png new file mode 100644 index 0000000..52c3fc5 Binary files /dev/null and b/ppo/imgs/clip_func.png differ diff --git a/ppo/imgs/clip_func_range.png b/ppo/imgs/clip_func_range.png new file mode 100644 index 0000000..bde67d3 Binary files /dev/null and b/ppo/imgs/clip_func_range.png differ diff --git a/ppo/ppo.md b/ppo/ppo.md index bd67768..962a76e 100644 --- a/ppo/ppo.md +++ b/ppo/ppo.md @@ -3,7 +3,6 @@ 如果被训练的agent和与环境做互动的agent(生成训练样本)是同一个的话,那么叫做on-policy(同策略)。 如果被训练的agent和与环境做互动的agent(生成训练样本)不是同一个的话,那么叫做off-policy(异策略)。 -https://blog.csdn.net/qq_33302004/article/details/115666895 ## 2. Importance Sampling(重要性采样) 为什么要使用重要性采样呢?其实从PG算法的梯度公式可以看出来: $$ @@ -91,9 +90,50 @@ $$ - 更新 $\theta$ 和 $\phi$ 以最小化总损失 $L_{total} = L_{surr}(\theta) - c_1 L_{vf}(\phi) + c_2 S[\pi_{\theta}](s_t) + ...$(其中 $S$ 是熵正则化项) - 更新 $\theta_{old} \leftarrow \theta$ -## 3.3 PPO2 - - - - - +### 3.3 PPO2 +PPO2也叫做PPO-Clip,该方法不采用KL散度作为约束,而是采用逻辑上合理的思路设计目标函数,其目标函数如下: +$$ +J^{\theta^{'}}(\theta) = \sum_{(s_t,a_t)} \min( \frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)} A^{\theta^{'}}(s_t,a_t), \gamma A^{\theta^{'}}(s_t,a_t) ) +\qquad{} (\gamma = clip(\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon)) +$$ +上面的式子中,clip函数是一个裁剪功能的函数,其具体的作用是将$p_\theta(a_t|s_t){p_{\theta^{'}}(a_t|s_t)}$限制在区间$[1-\epsilon,1+\epsilon]$中,clip函数的示意图1所示。PPO2的目标函数主要是希望提升累计期望回报的同时,$p_\theta(a_t|s_t)$和$p_{\theta^{'}}(a_t|s_t)$的差距不要太大,目标函数具体的实现思路如下: + +- A是比较优势,A>0表示比较优势大,我们要提升$p_\theta(a_t|s_t)$,如果A<0表示当前这个决策不佳,因此我们要减少$p_\theta(a_t|s_t)$。然而当A>0我们希望提升$p_\theta(a_t|s_t)$时,会受到重要性采样的约束,会使得提升$p_\theta(a_t|s_t)$的同时与$p_\theta^{'}(a_t|s_t)$的差距又不能太大。所以当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$大到一定程度时,就必须限制它的上限,否则将会出现两个分布差距过大的情况,这一个上限在PPO2中体现为$1+\epsilon$,当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$大于设置的上限阈值时,我们就不希望目标函数在提升$p_\theta(a_t|s_t)$上再获得收益。同理,当A<0时,我们希望减小$p_\theta(a_t|s_t)$,但是又不希望减小得过量,因此需要设置一个下限。这在PPO2中体现为$1-\epsilon$,即当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$小于下限时,我们也不会希望目标函数在此时获得任何的收益。通过这种手段,来收获最佳期望的同时并满足重要性采样的必要条件。 + +
+
+