diff --git a/__pycache__/share_func.cpython-310.pyc b/__pycache__/share_func.cpython-310.pyc new file mode 100644 index 0000000..815c6e2 Binary files /dev/null and b/__pycache__/share_func.cpython-310.pyc differ diff --git a/dqn/dqn.md b/dqn/dqn.md new file mode 100644 index 0000000..e69de29 diff --git a/dqn/dqn.py b/dqn/dqn.py new file mode 100644 index 0000000..19ee897 --- /dev/null +++ b/dqn/dqn.py @@ -0,0 +1,9 @@ +#!/usr/bin/env python +# -*- encoding: utf-8 -*- +''' +@File : dqn.py +@Time : 2024/09/20 10:01:00 +@Author : junewluo +@Email : overtheriver861@gmail.com +@description : implement of DQN FrameWork. +''' diff --git a/ppo/__pycache__/ppo.cpython-310.pyc b/ppo/__pycache__/ppo.cpython-310.pyc new file mode 100644 index 0000000..d3d954b Binary files /dev/null and b/ppo/__pycache__/ppo.cpython-310.pyc differ diff --git a/ppo/__pycache__/relaybuffer.cpython-310.pyc b/ppo/__pycache__/relaybuffer.cpython-310.pyc new file mode 100644 index 0000000..a44faf4 Binary files /dev/null and b/ppo/__pycache__/relaybuffer.cpython-310.pyc differ diff --git a/ppo/__pycache__/trick.cpython-310.pyc b/ppo/__pycache__/trick.cpython-310.pyc new file mode 100644 index 0000000..2fce3ab Binary files /dev/null and b/ppo/__pycache__/trick.cpython-310.pyc differ diff --git a/ppo/imgs/clip_func.png b/ppo/imgs/clip_func.png new file mode 100644 index 0000000..52c3fc5 Binary files /dev/null and b/ppo/imgs/clip_func.png differ diff --git a/ppo/imgs/clip_func_range.png b/ppo/imgs/clip_func_range.png new file mode 100644 index 0000000..bde67d3 Binary files /dev/null and b/ppo/imgs/clip_func_range.png differ diff --git a/ppo/ppo.md b/ppo/ppo.md index bd67768..962a76e 100644 --- a/ppo/ppo.md +++ b/ppo/ppo.md @@ -3,7 +3,6 @@ 如果被训练的agent和与环境做互动的agent(生成训练样本)是同一个的话,那么叫做on-policy(同策略)。 如果被训练的agent和与环境做互动的agent(生成训练样本)不是同一个的话,那么叫做off-policy(异策略)。 -https://blog.csdn.net/qq_33302004/article/details/115666895 ## 2. Importance Sampling(重要性采样) 为什么要使用重要性采样呢?其实从PG算法的梯度公式可以看出来: $$ @@ -91,9 +90,50 @@ $$ - 更新 $\theta$ 和 $\phi$ 以最小化总损失 $L_{total} = L_{surr}(\theta) - c_1 L_{vf}(\phi) + c_2 S[\pi_{\theta}](s_t) + ...$(其中 $S$ 是熵正则化项) - 更新 $\theta_{old} \leftarrow \theta$ -## 3.3 PPO2 - - - - - +### 3.3 PPO2 +PPO2也叫做PPO-Clip,该方法不采用KL散度作为约束,而是采用逻辑上合理的思路设计目标函数,其目标函数如下: +$$ +J^{\theta^{'}}(\theta) = \sum_{(s_t,a_t)} \min( \frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)} A^{\theta^{'}}(s_t,a_t), \gamma A^{\theta^{'}}(s_t,a_t) ) +\qquad{} (\gamma = clip(\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon)) +$$ +上面的式子中,clip函数是一个裁剪功能的函数,其具体的作用是将$p_\theta(a_t|s_t){p_{\theta^{'}}(a_t|s_t)}$限制在区间$[1-\epsilon,1+\epsilon]$中,clip函数的示意图1所示。PPO2的目标函数主要是希望提升累计期望回报的同时,$p_\theta(a_t|s_t)$和$p_{\theta^{'}}(a_t|s_t)$的差距不要太大,目标函数具体的实现思路如下: + +- A是比较优势,A>0表示比较优势大,我们要提升$p_\theta(a_t|s_t)$,如果A<0表示当前这个决策不佳,因此我们要减少$p_\theta(a_t|s_t)$。然而当A>0我们希望提升$p_\theta(a_t|s_t)$时,会受到重要性采样的约束,会使得提升$p_\theta(a_t|s_t)$的同时与$p_\theta^{'}(a_t|s_t)$的差距又不能太大。所以当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$大到一定程度时,就必须限制它的上限,否则将会出现两个分布差距过大的情况,这一个上限在PPO2中体现为$1+\epsilon$,当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$大于设置的上限阈值时,我们就不希望目标函数在提升$p_\theta(a_t|s_t)$上再获得收益。同理,当A<0时,我们希望减小$p_\theta(a_t|s_t)$,但是又不希望减小得过量,因此需要设置一个下限。这在PPO2中体现为$1-\epsilon$,即当$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)}$小于下限时,我们也不会希望目标函数在此时获得任何的收益。通过这种手段,来收获最佳期望的同时并满足重要性采样的必要条件。 + +
+ +
+
图1 - clip函数裁剪示意图
+
+ +在了解clip函数之后,再对PPO2的目标函数进行分析,结合图2分析,其中假设$\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)} $是一个递增的函数形式。可以观察到: + +- $\frac{p_\theta(a_t|s_t)}{p_{\theta^{'}}(a_t|s_t)} $ 对应绿色曲线 +- 蓝色的曲线是clip函数的曲线 +- 红色的曲线是最终PPO2优化函数里面的$min$取得的最小值 + +不难看出来,在绿色的线跟蓝色的线中间,我们要取一个最小的。假设前面乘上的这个优势项A,它是大于0的话,取最小的结果,就是左侧图片中红色的这一条线。 +同理,如果A小于0的话,取最小的以后,就得到右侧侧图片中红色的这一条线 + +
+ +
+
图2 - PPO2目标函数优化
+
+ + +## 4. Implement of PPO2 +本节将会实现一个Actor-Critic的PPO2算法。依据第三小节中PPO2的目标函数,我们可以知道实现PPO2有几个重要的点: + +- 优势函数的实现 +- \ No newline at end of file diff --git a/ppo/ppo.py b/ppo/ppo.py index 0ba13ae..5db7112 100644 --- a/ppo/ppo.py +++ b/ppo/ppo.py @@ -16,35 +16,6 @@ from torch.distributions import Categorical """ PPO Algorithm -PPO算法是一种基于Actor-Critic的架构, 它的代码组成和A2C架构很类似。 - -1、 传统的A2C结构如下: - ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ backward ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ - ⬇ ⬆ - State & (R, new_State) ----> Critic ----> value -- 与真正的价值reality_value做差 --> td_e = reality_v - v - ⬇ - State ----> Actor ---Policy--> P(s,a) ➡➡➡➡➡➡➡➡➡➡➡ 两者相加 ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ - ⬇ ⬇ - ⬇ ⬇ - ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ actor_loss = Log[P(s,a)] + td_e - - ** Actor Section **:由Critic网络得到的输出td_e,和Actor网络本身输出的P(s,a)做Log后相加得到了Actor部分的损失,使用该损失进行反向传播 - ** Critic Section **: Critic部分,网络接收当前状态State,以及env.step(action)返回的奖励(Reward,简写为R)和新的状态new_State - -2、 实际上PPO算法的Actor-Critic框架实现: - ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ backward ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ - ⬇ ⬆ - State & (r, new_State) ➡➡➡ Critic ----> value -- 与真正的价值reality_value做差 --> td_e = reality_v - v - ⬇ - State ➡➡➡ Actor[old_policy] ➡➡➡ P_old(s,a) ➡➡➡➡➡➡➡➡➡➡ ratio = P_new(s,a) / P_old(s,a) ➡➡ 依据式子(1)计算loss ➡➡➡➡ loss - ⬇ ⬆ ⬇ - ⬇ ⬆[两者做商,得到重要性权值] ⬇ - ⬇ ⬆ ⬇ - ⬇ ➡➡➡➡ Actor[new_policy] ➡➡➡ P_new(s,a) ➡➡➡➡➡➡➡➡➡ ⬆ ⬇ - ⬆ ⬇ - ⬆ ⬇ - ⬆⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅ backbard ⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬅⬇ -同时将会实现PPO的一些技巧: """ diff --git a/runs/PPO_CartPole-v1_number_seed_1/events.out.tfevents.1726755309.DESKTOP-5L8V8DT.33612.0 b/runs/PPO_CartPole-v1_number_seed_1/events.out.tfevents.1726755309.DESKTOP-5L8V8DT.33612.0 deleted file mode 100644 index 5856927..0000000 Binary files a/runs/PPO_CartPole-v1_number_seed_1/events.out.tfevents.1726755309.DESKTOP-5L8V8DT.33612.0 and /dev/null differ diff --git a/runs/PPO_CartPole-v1_number_seed_1/events.out.tfevents.1726801376.hcss-ecs-99d8.61776.0 b/runs/PPO_CartPole-v1_number_seed_1/events.out.tfevents.1726801376.hcss-ecs-99d8.61776.0 new file mode 100644 index 0000000..e1167f2 Binary files /dev/null and b/runs/PPO_CartPole-v1_number_seed_1/events.out.tfevents.1726801376.hcss-ecs-99d8.61776.0 differ