Video action models are an appealing foundation for Vision–Language–Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies. We propose VAMPO, a post-training framework that directly improves visual dynamics in video action models through policy optimization. Our key idea is to formulate multi-step denoising as a sequential decision process and optimize the denoising policy with rewards defined over expert visual dynamics in latent space. To make this optimization practical, we introduce an Euler Hybrid sampler that injects stochasticity only at the first denoising step, enabling tractable low-variance policy-gradient estimation while preserving the coherence of the remaining denoising trajectory. We further combine this design with GRPO and a verifiable non-adversarial reward based on L1 distance and cosine similarity. Across diverse simulated and real-world manipulation tasks, VAMPO improves task-relevant visual dynamics, leading to better downstream action generation and stronger generalization.
- Inference and evaluation code on Calvin
- Reinforcement learning post-training code
conda create -n VAMPO python==3.10
conda activate VAMPO
# Install calvin as described in (https://github.com/mees/calvin).
git clone --recurse-submodules https://github.com/mees/calvin.git
$ export CALVIN_ROOT=$(pwd)/calvin
cd $CALVIN_ROOT
sh install.sh
# Install VAMPO requirements
cd ..
pip install -r requirements.txt| Ckpt name | Training type | Size |
|---|---|---|
| VAMPO_svd | SVD video model trained by our method | ~8G |
| VAMPO_policy | Action model trained on annoted calvin abc dataset | ~1G |
| clip-vit-base-patch32 | CLIP text encoder | ~600M |
First, you need to follow instructions in the officail calvin repo to install the calvin environments and download official calvin ABC-D dataset(about 500 G).
Next, download the VAMPO_svd video model and VAMPO_policy action model. Set the video_model_folder and action_model_folder to the folder where you save the model in the script.
bash scripts/eval_calvin.shDyn-VPP is developed from Video prediction policy. We thank the authors for their efforts!
