VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

🚀 Overview

Video action models are an appealing foundation for Vision–Language–Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies. We propose VAMPO, a post-training framework that directly improves visual dynamics in video action models through policy optimization. Our key idea is to formulate multi-step denoising as a sequential decision process and optimize the denoising policy with rewards defined over expert visual dynamics in latent space. To make this optimization practical, we introduce an Euler Hybrid sampler that injects stochasticity only at the first denoising step, enabling tractable low-variance policy-gradient estimation while preserving the coherence of the remaining denoising trajectory. We further combine this design with GRPO and a verifiable non-adversarial reward based on L1 distance and cosine similarity. Across diverse simulated and real-world manipulation tasks, VAMPO improves task-relevant visual dynamics, leading to better downstream action generation and stronger generalization.

📌 Release Progress

Inference and evaluation code on Calvin
Reinforcement learning post-training code

🛠️ Installation

conda create -n VAMPO python==3.10
conda activate VAMPO

# Install calvin as described in (https://github.com/mees/calvin). 
git clone --recurse-submodules https://github.com/mees/calvin.git
$ export CALVIN_ROOT=$(pwd)/calvin
cd $CALVIN_ROOT
sh install.sh

# Install VAMPO requirements
cd ..
pip install -r requirements.txt

📷 CheckPoints

Ckpt name	Training type	Size
VAMPO_svd	SVD video model trained by our method	~8G
VAMPO_policy	Action model trained on annoted calvin abc dataset	~1G
clip-vit-base-patch32	CLIP text encoder	~600M

📊 Evaluation on Calvin abc benchmark

First, you need to follow instructions in the officail calvin repo to install the calvin environments and download official calvin ABC-D dataset(about 500 G).

Next, download the VAMPO_svd video model and VAMPO_policy action model. Set the video_model_folder and action_model_folder to the folder where you save the model in the script.

bash scripts/eval_calvin.sh

Acknowledgement

Dyn-VPP is developed from Video prediction policy. We thank the authors for their efforts!

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
calvin		calvin
grpo		grpo
policy_conf		policy_conf
policy_evaluation		policy_evaluation
policy_models		policy_models
policy_training		policy_training
scripts		scripts
video_conf		video_conf
video_dataset		video_dataset
video_dataset_instance		video_dataset_instance
video_models		video_models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
make_prediction.py		make_prediction.py
requirements.txt		requirements.txt
step0_prepare_latent.py		step0_prepare_latent.py
step1_train_svd.py		step1_train_svd.py
step1_train_svd_grpo.py		step1_train_svd_grpo.py
step2_prepare_json.py		step2_prepare_json.py
step2_train_action_calvin.py		step2_train_action_calvin.py
step2_train_action_xbot.py		step2_train_action_xbot.py
step3_deploy_real_xbot.py		step3_deploy_real_xbot.py
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

🚀 Overview

📌 Release Progress

🛠️ Installation

📷 CheckPoints

📊 Evaluation on Calvin abc benchmark

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

🚀 Overview

📌 Release Progress

🛠️ Installation

📷 CheckPoints

📊 Evaluation on Calvin abc benchmark

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages