- Install MemoryVLA following original steps
# install torch
micromamba -n memvla python=3.10
micromamba activate memvla
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
micromamba install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -y
# install flash-attn
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# install MemoryVLA
cd MemoryVLA
pip install -e .
- Install RoboMME as in MME-VLA-Suite. You can just reuse the same environment.
Option1: Generate RLDS data format following rlds_dataset_builder, we provide our script in MemoryVLA/rlds_dataset_builder/robomme for study.
Option2: Download our processed rlds data from here directly.
The provided dataset contains full trajectories (demo + execution frames), but does not contain the
is_video_demolabel to differentiate the frame type. If you need this label, you can regenerate the data via Option 1
bash script/train/robomme/train.sh
We provide our trained MemoryVLA ckpt here
# Terminal 0
micromamba activate memvla
bash script/eval/robomme/server.sh
# Terminal 1
# After the server is already running, then run
micromamba activate robomme
bash script/eval/robomme/client.sh
Currently, we keep all hyperparameters the same as in their LIBERO training and testing setups, while only change the batch size to 64 and total training steps to 160k, which may not be optimal settings.
| Suite | Task | Success Rate | |||
|---|---|---|---|---|---|
| Seed 7 | Seed 42 | Seed 0 | Avg | ||
| Counting | BinFill | 0.12 | 0.10 | 0.08 | 0.10 |
| PickXtimes | 0.06 | 0.24 | 0.22 | 0.17 | |
| SwingXtimes | 0.00 | 0.02 | 0.02 | 0.01 | |
| StopCube | 0.00 | 0.00 | 0.00 | 0.00 | |
| Permanence | VideoUnmask | 0.10 | 0.18 | 0.20 | 0.16 |
| VideoUnmaskSwap | 0.04 | 0.04 | 0.06 | 0.05 | |
| ButtonUnmask | 0.04 | 0.06 | 0.14 | 0.08 | |
| ButtonUnmaskSwap | 0.04 | 0.04 | 0.06 | 0.05 | |
| Reference | PickHighlight | 0.04 | 0.10 | 0.08 | 0.07 |
| VideoRepick | 0.00 | 0.00 | 0.00 | 0.00 | |
| VideoPlaceButton | 0.12 | 0.16 | 0.12 | 0.13 | |
| VideoPlaceOrder | 0.08 | 0.00 | 0.04 | 0.04 | |
| Imitation | MoveCube | 0.12 | 0.18 | 0.12 | 0.14 |
| InsertPeg | 0.00 | 0.00 | 0.00 | 0.00 | |
| PatternLock | 0.12 | 0.10 | 0.14 | 0.12 | |
| RouteStick | 0.02 | 0.00 | 0.02 | 0.01 | |
| Overall | 0.0563 | 0.0763 | 0.0813 | 0.0713 | |
Note: This is only a basic adaptation of MemoryVLA to RoboMME. If you obtain better results with MemoryVLA, do not hesitate to submit your models following this guideline.
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang
Tsinghua University, Dexmal, MEGVII, TJU, HiT, StepFun
ICLR 2026
This is the code for the paper "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation".
π Project Page | πPaper | π€Models & Logs
- π₯ [2026-1-27] Our paper MemoryVLA is accepted by ICLR 2026!
- π₯ [2025-11-5] The code of MemoryVLA is released! (Both MemoryVLA and MemoryVLA+)
- π₯ [2025-10-20] Our VLA codebase Dexbotic is released, it now fully integrates MemoryVLA !
- π₯ [2025-8-26] Our paper MemoryVLA is now on arxiv!
MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.
We release two versions of the code in separate branches:
- MemoryVLA: built upon the OpenVLA codebase.
- MemoryVLA+: built upon our self-developed Dexbotic codebase, which offers higher simulation performance.
All components are now available, and we will continue to refine and improve the code.
-
Code Release
- MemoryVLA (OpenVLA codebase)
- MemoryVLA+ (Dexbotic codebase)
-
Model Weights Release
-
Dataset Upload to HuggingFace
This is MemoryVLA based on OpenVLA codebase, if you need use dexbotic codebase, please use MemoryVLA+.
- Model Zoo & Benchmark Results
- Install
- Training
- Evaluation in SimplerEnv
- Evaluation in LIBERO
- Deployment in The Real World
- FAQ
- Citation
All datasets use only third-person RGB and language, without using wrist-view images or state. MemoryVLA means openvla-codebase version, MemoryVLA+ means dexbotic-codebase version.
| Model | Spoon | Carrot | Cube | Eggplant | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|
| MemoryVLA | 75.0 | 75.0 | 37.5 | 100.0 | 71.9 | π€ HF |
| MemoryVLA+ | 100.0 | 66.7 | 70.8 | 100.0 | 84.4 | π€ HF |
| Model | Spatial | Object | Goal | Long-10 | Long-90 | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|---|
| MemoryVLA | 98.4 | 98.4 | 96.4 | 93.4 | 95.6 | 96.5 | π€ Spa, π€ Obj, π€ Goal, π€ 100 |
| MemoryVLA+ | 98.2 | 97.8 | 96.4 | 93.6 | 96.2 | 96.5 | π€ Spa, π€ Obj, π€ Goal, π€ 100 |
| MemoryVLA+ (mix) | 97.2 | 99.2 | 98.4 | 93.2 | 97.2 | 97.1 | π€ HF |
| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|
| MemoryVLA | 90.7 | 88.0 | 84.7 | 47.2 | 77.7 | π€ HF |
| MemoryVLA+ | 92.0 | 91.7 | 71.8 | - | - | π€ HF |
| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|
| MemoryVLA | 80.5 | 78.8 | 53.2 | 58.3 | 67.7 | π€ HF |
| MemoryVLA+ | 83.5 | 81.8 | 63.2 | - | - | π€ HF |
| Model | Pick Cube | Stack Cube | Pick Single YCB | Pick Single EGAD | Pick Clutter YCB | Avg. | CKPT & Logs |
|---|---|---|---|---|---|---|---|
| MemoryVLA+ | 85 | 75 | 60 | 85 | 45 | 70 | π€ HF |
The code is built using Python 3.10, and we use PyTorch == 2.2.0 and CUDA == 12.1 (It may run with lower versions, but we have not tested it).
We recommend using Miniconda and setting up an environment:
conda create --name memvla python=3.10
conda activate memvla
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
conda install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -yIf you need to use the traning code, please also install the Flash Attention, we use flash-attn==2.5.5:
# Install Flash Attention 2.5.5, this is an example for pytorch2.2-cuda12.1
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whlNext, clone our repo and install the required packages:
git clone https://github.com/shihao1895/MemoryVLA
cd MemoryVLA
pip install -e .If you are using an NVIDIA Hopper GPU (e.g., H20) and encounter the error
βFloating point exception (core dumped)β, try reinstalling the specific cuBLAS version below:
# Fix for NVIDIA H20: "Floating point exception (core dumped)"
pip install nvidia-cublas-cu12==12.4.5.8-
Prepare training dataset with RLDS format:
- LIBERO (including Spatial, Object, Goal, Long-10, Long-90 suites)
- Bridge from Open X-Embodiment (OXE)
- Fractal from Open X-Embodiment (OXE)
# Make sure you have git-lfs installed (https://git-lfs.com) git lfs install # Download the LIBERO dataset (processed, ~22 GB) git clone https://huggingface.co/datasets/shihao1895/libero-rlds # Download the Bridge dataset (processed, ~157 GB) git clone https://huggingface.co/datasets/shihao1895/bridge-rlds # Download the Fractal dataset (processed) git clone https://huggingface.co/datasets/shihao1895/fractal-rlds
-
Download pretrained model, we use OpenVLA Pretrained Model for LIBERO training, and CogACT Pretrained Model for Bridge and Fractal training.
# Download OpenVLA pretrained checkpoint (~30 GB) git clone https://huggingface.co/openvla/openvla-7b-prismatic # Download CogACT pretrained checkpoint (~31 GB) git clone https://huggingface.co/CogACT/CogACT-Large
-
Train the model on different datasets
Before training, modify several parameters in the corresponding scripts, such as
hf_token,wandb_entity, checkpoint paths, dataset paths, and log directories.We train on a single node with 8Γ NVIDIA A100 GPUs.
# Train on the Bridge dataset bash script/train/bridge/train_bridge.sh # Train on the LIBERO-Spatial dataset bash script/train/libero/train_libero_spatial.sh # Train on the LIBERO-Object dataset bash script/train/libero/train_libero_object.sh # Train on the LIBERO-Goal dataset bash script/train/libero/train_libero_goal.sh # Train on the LIBERO-100 dataset bash script/train/libero/train_libero_100.sh # Train on the Fractal dataset bash script/train/fractal/train_fractal.sh # Train on real-world data bash script/train/real_world/train_real.sh
To finetune on your own customized data, please follow the instruction (rlds_dataset_builder) for converting your data to RLDS format. The actions should be the deltas of end effector
EEF Delta XYZ (3) + Roll-Pitch-Yaw (3) + Gripper Open/Close (1). Once your customized data is ready, place the customized data directly under the<data_root_dir>/custom_finetuning/1.0.0directory. Then setvla.data_mix="custom_finetuning".
We provide evaluation interfaces and scripts based on SimplerEnv.
-
Please follow the installation guide in the SimplerEnv Repo to set up the simulation environment, and make sure to place the repo under:
./third_libs/SimplerEnv -
Evaluation Example.
# Run evaluation bash script/eval/bridge/eval_bridge.sh # Summarize results python script/eval/bridge/extract_bridge_results.py
NOTE: Due to the instability of the SimplerEnv benchmark and diffusion process, the performance scores across different iterations can vary significantly. Please evaluate multiple checkpoints and report the best result.
We also provide evaluation interfaces and scripts based on LIBERO.
-
Please follow the installation guide in the LIBERO Repo to set up the simulation environment, and make sure to place the repo under:
./third_libs/LIBERO -
Evaluation Example.
# Run evaluation bash script/eval/libero/eval_libero.sh # Summarize results python script/eval/libero/extract_libero_results.py
NOTE: The evaluation mechanism here is different from SimplerEnv. The process first loads the model using
develop.py, then waits for a period before runningevaluation/libero/eval_libero.pyfor testing. In addition, since performance may vary across iterations, please evaluate multiple checkpoints and report the best result.
To deploy the model on your own robot, first collect corresponding real-world manipulation data (e.g., via teleoperation), and use it to fine-tune the pretrained model.
Next, set up the server and client as shown in deploy.py, and deploy the system on your real robot.
The following command launches the server:
bash script/eval/real_world/deploy.shThe robot acts as the client, and for each request it must send the following three items to obtain the action chunking result. The field episode_first_frame is a string ('True' or 'False') indicating whether the current frame is the first frame of the episode.
image = request.files['image']
query = request.form['text']
episode_first_frame = request.form['episode_first_frame']This deployment process follows a similar design to OpenVLA and CogACT.
SimplerEnv and ManiSkill may involve several dependency issues during installation. Below are some common troubleshooting tips based on our experience.
(1) Vulkan / SAPIEN issues
Example errors:
ImportError: libvulkan.so.1: cannot open shared object file: No such file or directory
Some required Vulkan extension is not present. You may not use the renderer to render, however, CPU resources will be still available.
Fix:
sudo apt install -y libegl1-mesa libgl1-mesa-dev libgles2-mesa-devand reference: https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting
Note: Check that the .json files correctly link to the .so file corresponding to your current NVIDIA driver version. Use
nvidia-smito check your driver version and locate the correct .so under /usr/lib/x86_64-linux-gnu/.
(2) OpenGL issues
Example errors:
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
Fix:
sudo apt install -y libgl1 libglib2.0-0 libglx-mesa0 libopengl0 libglu1-mesa mesa-utils(3) Video recording in SimplerEnv
sudo apt install -y ffmpeg(4) Benchmark Score Fluctuations
Benchmark scores tend to fluctuate, so we recommend evaluating checkpoints at regular iteration intervals and reporting the best result. Moreover, we have observed that even slight differences in Conda package versions may lead to small variations in the scores.
If you find our work helpful in your research, please consider citing our paper.
@article{shi2025memoryvla,
title={MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation},
author={Shi, Hao and Xie, Bin and Liu, Yingfei and Sun, Lin and Liu, Fengrong and Wang, Tiancai and Zhou, Erjin and Fan, Haoqiang and Zhang, Xiangyu and Huang, Gao},
journal={arXiv preprint arXiv:2508.19236},
year={2025}
}
@article{dexbotic,
title={Dexbotic: Open-Source Vision-Language-Action Toolbox},
author={Dexbotic Contributors},
journal={arXiv preprint arXiv:2510.23511},
year={2025}
}