Yuexin Bian1 ·
Jie Feng1 ·
Tao Wang1 ·
Yijiang Li1 ·
Sicun Gao1 ·
Yuanyuan Shi1 ·
1University of California San Diego
[ICML 2026](https://icml.cc/virtual/2026/poster/61217)
We demonstrate that our approach significantly im- proves RL performance and accelerates convergence across standard locomotion and ManiSkill bench- marks, covering both state-based and vision-based tasks. Moreover, the method is algorithm-agnostic and consistently benefits multiple on-policy algorithms
Simply replacing the standard Gaussian actor with our proposed actor substantially improves perfor- mance for continuous control, achieving state-of-the-art results within on-policy RL.
- 2026-05 — Paper accepted to ICML 2026 🎉
- 2026-05 — Code release.
git clone https://github.com/alwaysbyx/RND-RL.git
cd RND-RL
conda env create -f environment.yml
conda activate rlManiSkill environments require additional setup; see the ManiSkill installation guide.
Train PPO on a Gym MuJoCo task with a residual actor and discretized action head:
python train.py --config configs/gym.yaml --config_overrides "['env_id=HalfCheetah-v4']"Train on a ManiSkill state-based task:
python train.py --config configs/maniskill-state.yaml --config_overrides "['env_id=PickCube-v1']"Train on a ManiSkill RGB task:
python train.py --config configs/maniskill-rgb.yaml --config_overrides "['env_id=PickCube-v1']"Train with TRPO:
python scripts/train_trpo/run_trpo_gym_dynamic.pyTrain with PPO-CMA (baseline):
python train_ppocma.py --config configs/gym-ppocma.yaml| Flag | Description |
|---|---|
discrete_action |
Discretize each action dimension into num_bins bins |
num_bins |
Number of bins per action dimension (default 41) |
use_residual_blocks |
Use Simba-style residual actor instead of MLP |
actor_width / actor_depth |
Actor hidden width / number of residual blocks |
critic_width / critic_depth |
Critic hidden width / depth |
We use Weights & Biases for experiment tracking. Before running, configure your own W&B project:
-
Install and log in once:
pip install wandb && wandb login -
Set your entity and (optionally) project name. You can do this in three ways:
Per-run via CLI override
python train.py --config configs/gym.yaml --config_overrides "['track=true','wandb_entity=<your-entity>','wandb_project_name=<your-project>']"In a config file — edit the
wandb_entity/wandb_project_namefields inconfigs/*.yaml.In the launcher scripts — each script in
scripts/has a clearly marked"wandb_entity": None, # TODO: set to your wandb entity
near the top. Set it before running paper sweeps. The dynamic schedulers (
run_*_dynamic.py) additionally use the entity to skip already-finished runs; pass--no-wandb-checkto bypass this.
To disable W&B entirely, run with track=false (the default) — TensorBoard logs are still written under runs/.
The launcher scripts in scripts/ sweep seeds, environments, and method variants used in the paper. They auto-distribute jobs across available GPUs.
# Gym (MuJoCo) ablation: discrete × residual variants
python scripts/run_gym_experiments.py
# ManiSkill state-based benchmark
python scripts/run_maniskill_state_experiments.py
# ManiSkill RGB benchmark
python scripts/run_maniskill_rgb_experiments.py
# Architecture ablation (residual vs MLP, discrete vs continuous)
python scripts/run_ablation_dynamic.py
# Baselines
python scripts/run_ppocma_gym_experiments.py
python scripts/run_ppocma_maniskill_state_experiments.pyLogs are written to runs/ and (optionally) Weights & Biases. Generate the paper figures with:
python scripts/make_demo_video.py # qualitative rollouts
# plotting utilities: see figures/RND-RL/
├── train.py # Unified PPO trainer (Gym + ManiSkill state/RGB)
├── train_ppocma.py # PPO-CMA baseline trainer
├── rnd.py # Agent, residual actor, discrete-action head, NatureCNN
├── ppocma.py # PPO-CMA agent
├── eval.py # Evaluation loop
├── configs/ # Hydra/YAML configs (gym, maniskill-state, maniskill-rgb, ...)
├── envs/ # Env wrappers (Gym, ManiSkill state, ManiSkill RGB)
├── scripts/ # Multi-seed / multi-GPU launchers + TRPO trainers
├── figures/ # Plots used in the paper
└── environment.yml # Conda environment
- Gym / MuJoCo —
HalfCheetah-v4,Hopper-v4,Walker2d-v4,Ant-v4, … - ManiSkill (state) —
PickCube-v1,PushCube-v1,StackCube-v1,PushT-v1,PullCube-v1,PokeCube-v1,LiftPegUpright-v1,PickSingleYCB-v1,RollBall-v1,PickCubeSO100-v1 - ManiSkill (RGB) — same task suite with visual observations
If you find this work useful, please cite our work.
[MIT] — see LICENSE.
