Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added __pycache__/share_func.cpython-310.pyc
Binary file not shown.
Empty file added dqn/dqn.md
Empty file.
9 changes: 9 additions & 0 deletions dqn/dqn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File : dqn.py
@Time : 2024/09/20 10:01:00
@Author : junewluo
@Email : overtheriver861@gmail.com
@description : implement of DQN FrameWork.
'''
Binary file added ppo/__pycache__/ppo.cpython-310.pyc
Binary file not shown.
Binary file added ppo/__pycache__/relaybuffer.cpython-310.pyc
Binary file not shown.
Binary file added ppo/__pycache__/trick.cpython-310.pyc
Binary file not shown.
Binary file added ppo/imgs/clip_func.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added ppo/imgs/clip_func_range.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 47 additions & 7 deletions ppo/ppo.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
如果被训练的agent和与环境做互动的agent(生成训练样本)是同一个的话,那么叫做on-policy(同策略)。
如果被训练的agent和与环境做互动的agent(生成训练样本)不是同一个的话,那么叫做off-policy(异策略)。

https://blog.csdn.net/qq_33302004/article/details/115666895
## 2. Importance Sampling(重要性采样)
为什么要使用重要性采样呢?其实从PG算法的梯度公式可以看出来:
$$
Expand Down Expand Up @@ -91,9 +90,50 @@ $$
- 更新 $\theta$ 和 $\phi$ 以最小化总损失 $L_{total} = L_{surr}(\theta) - c_1 L_{vf}(\phi) + c_2 S[\pi_{\theta}](s_t) + ...$(其中 $S$ 是熵正则化项)
- 更新 $\theta_{old} \leftarrow \theta$

## 3.3 PPO2





### 3.3 PPO2
PPO2也叫做PPO-Clip,该方法不采用KL散度作为约束,而是采用逻辑上合理的思路设计目标函数,其目标函数如下:
$$
J^{\theta^{'}}(\theta) = \sum_{(s_t,a_t)} \min( \frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)} A^{\theta^{'}}(s_t,a_t), \gamma A^{\theta^{'}}(s_t,a_t) )
\qquad{} (\gamma = clip(\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon))
$$
上面的式子中,clip函数是一个裁剪功能的函数,其具体的作用是将$p_\theta(a_t|s_t){p_{\theta^{'}}(a_t|s_t)}$限制在区间$[1-\epsilon,1+\epsilon]$中,clip函数的示意图1所示。PPO2的目标函数主要是希望提升累计期望回报的同时,$p_\theta(a_t|s_t)$和$p_{\theta^{'}}(a_t|s_t)$的差距不要太大,目标函数具体的实现思路如下:

- A是比较优势,A>0表示比较优势大,我们要提升$p_\theta(a_t|s_t)$,如果A<0表示当前这个决策不佳,因此我们要减少$p_\theta(a_t|s_t)$。然而当A>0我们希望提升$p_\theta(a_t|s_t)$时,会受到重要性采样的约束,会使得提升$p_\theta(a_t|s_t)$的同时与$p_\theta^{'}(a_t|s_t)$的差距又不能太大。所以当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$大到一定程度时,就必须限制它的上限,否则将会出现两个分布差距过大的情况,这一个上限在PPO2中体现为$1+\epsilon$,当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$大于设置的上限阈值时,我们就不希望目标函数在提升$p_\theta(a_t|s_t)$上再获得收益。同理,当A<0时,我们希望减小$p_\theta(a_t|s_t)$,但是又不希望减小得过量,因此需要设置一个下限。这在PPO2中体现为$1-\epsilon$,即当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$小于下限时,我们也不会希望目标函数在此时获得任何的收益。通过这种手段,来收获最佳期望的同时并满足重要性采样的必要条件。

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src = "imgs/clip_func.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">图1 - clip函数裁剪示意图</div>
</center>

在了解clip函数之后,再对PPO2的目标函数进行分析,结合图2分析,其中假设$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)} $是一个递增的函数形式。可以观察到:

- $\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)} $ 对应绿色曲线
- 蓝色的曲线是clip函数的曲线
- 红色的曲线是最终PPO2优化函数里面的$min$取得的最小值

不难看出来,在绿色的线跟蓝色的线中间,我们要取一个最小的。假设前面乘上的这个优势项A,它是大于0的话,取最小的结果,就是左侧图片中红色的这一条线。
同理,如果A小于0的话,取最小的以后,就得到右侧侧图片中红色的这一条线

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src = "imgs/clip_func_range.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">图2 - PPO2目标函数优化</div>
</center>


## 4. Implement of PPO2
本节将会实现一个Actor-Critic的PPO2算法。依据第三小节中PPO2的目标函数,我们可以知道实现PPO2有几个重要的点:

- 优势函数的实现
-
29 changes: 0 additions & 29 deletions ppo/ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,35 +16,6 @@
from torch.distributions import Categorical

""" PPO Algorithm
PPO算法是一种基于Actor-Critic的架构, 它的代码组成和A2C架构很类似。

1、 传统的A2C结构如下:
⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ backward ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅
⬇ ⬆
State & (R, new_State) ----> Critic ----> value -- 与真正的价值reality_value做差 --> td_e = reality_v - v
State ----> Actor ---Policy--> P(s,a) ➡➡➡➡➡➡➡➡➡➡➡ 两者相加 ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅
⬇ ⬇
⬇ ⬇
⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ actor_loss = Log[P(s,a)] + td_e

** Actor Section **:由Critic网络得到的输出td_e,和Actor网络本身输出的P(s,a)做Log后相加得到了Actor部分的损失,使用该损失进行反向传播
** Critic Section **: Critic部分,网络接收当前状态State,以及env.step(action)返回的奖励(Reward,简写为R)和新的状态new_State

2、 实际上PPO算法的Actor-Critic框架实现:
⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ backward ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅
⬇ ⬆
State & (r, new_State) ➡➡➡ Critic ----> value -- 与真正的价值reality_value做差 --> td_e = reality_v - v
State ➡➡➡ Actor[old_policy] ➡➡➡ P_old(s,a) ➡➡➡➡➡➡➡➡➡➡ ratio = P_new(s,a) / P_old(s,a) ➡➡ 依据式子(1)计算loss ➡➡➡➡ loss
⬇ ⬆ ⬇
⬇ ⬆[两者做商,得到重要性权值] ⬇
⬇ ⬆ ⬇
⬇ ➡➡➡➡ Actor[new_policy] ➡➡➡ P_new(s,a) ➡➡➡➡➡➡➡➡➡ ⬆ ⬇
⬆ ⬇
⬆ ⬇
⬆⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ backbard ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬇
同时将会实现PPO的一些技巧:

"""

Expand Down
Binary file not shown.
Binary file not shown.