Skip to content

Jianshu-Hu/Generalizable-CLAP

Repository files navigation

Coarse-to-fine Language-Aligned manipulation Policy (CLAP)

To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation.

🔗 Website 📄 arXiv

Getting Started

Install

  • Tested (Recommended) Versions: Python 3.10 and CUDA 12.1.

  • Step 1 (Optional): We recommend using conda and creating a virtual environment.

conda create --name clap python=3.10
conda activate clap
  • Step 2: Install PyTorch. Make sure the PyTorch version is compatible with the CUDA version.More instructions to install PyTorch can be found here.
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

check cuda is available with the installed torch before moving to next step.

  • Step 3: Install PyTorch3D. For more instructions visit here.
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"

Once you have downloaded CoppeliaSim, add the following to your ~/.bashrc file. (NOTE: the 'EDIT ME' in the first line)

export COPPELIASIM_ROOT=<EDIT ME>/PATH/TO/COPPELIASIM/INSTALL/DIR
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOT
export DISPLAY=:1.0

For headless server, use this line to replace the last line above

Xvfb :0 -screen 0 1024x768x24 +extension GLX +render -noreset & export DISPLAY=:0

Remember to source your .bashrc (source ~/.bashrc) or .zshrc (source ~/.zshrc) after this.

  • Step 5: Clone the repository with the submodules using the following command.
git clone --recurse-submodules https://github.com/Jianshu-Hu/CLAP.git && cd CLAP && git submodule update --init
  • Step 6: Install packages for fine-tuning VLM with ms-swift
pip install ms-swift==3.5.2
pip install transformers==4.51.3
pip install modelscope==1.27.1
pip install peft==0.15.2
pip install trl==0.18
pip install deepspeed==0.16.9
pip install vllm==0.8.5.post1
pip install qwen_vl_utils
  • Step 7: Install, required libraries such as PyRep, RLBench, YARR, Point Renderer and robot-colosseum.
pip install -e libs/PyRep 
pip install -e libs/RLBench 
pip install -e libs/YARR 
pip install -e libs/point-renderer
pip install -e libs/robot-colosseum-rvt/
pip install transforms3d
pip install timm
pip install bitsandbytes
pip install openai-clip
pip install pyquaternion
  • Step 8: Collect dataset.

    • You can generate the initial demonstrations using the following command. They will be generated under Generalizable-CLAP/data/gembench/xxx where xxx is either train, test, or val. And modify DATA_DIR in config.py to match the location.
    bash scripts/collect_gembench_data.sh
    
    • Additionally, we use the same dataloader as PerAct, which is based on YARR. It will save the replay buffer in the disk (It will only be created once when you run the low-level training). You can modify TASK_REPLAY_STORAGE_FOLDER in config.py to decide the location for saving the replay buffer.
  • Additional notes:

    • For headless server, if you faced issue related to qt such as Could not find the Qt platform plugin "xcb", try
    pip uninstall opencv-python opencv-python-headless
    pip install opencv-python-headless
    
    • If you faced issue related to libGL such as miniconda3/envs/robot-vlm/bin/../lib/libstdc++.so.6: version GLIBCXX_3.4.30' not found, try the following command to see if your computer already has GLIBCXX_3.4.30 locally.
    strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
    

    If you can see GLIBCXX_3.4.30, try to backup the original libstdc++.so.6 and copy your local libstdc++.so.6 to the conda env. Remember to replace the directory with your own path.

    mv miniconda3/envs/robot-vlm/lib/libstdc++.so.6 miniconda3/envs/robot-vlm/lib/libstdc++.so.6.old
    cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 miniconda3/envs/robot-vlm/lib/libstdc++.so.6
    

    If you can not see GLIBCXX_3.4.30, try to update the libstdc++ library.

Train and eval in GemBench

Train

  • Coarse Task Planner:
    • Step 1: Prepare training data.
      bash scripts/prepare_gembench_pretraining_data.sh
      
      See detailed instructions for more information.
    • Step 2: Train high-level module for GemBench. We provide scripts for multi-gpu training:
      bash scripts/sft_gembench.sh
      
      and python code for single gpu training:
      python train.py --tag coarse_task_planner --task_name gembench --num_episodes 10 --data_type lang_keypoints --cot 9 --epochs 1 --lr 0.0003 --eval_save_steps 250 --include_lang_plan gembench
      
  • Fine-grained action predictor:
    • Step 1: Train low-level policy for Gembench. Note that you need to set gradient_accumulation as 16/num_gpus according to the number of gpus you use. For example, if you run with one gpu, run with:
      python finegrained_policy/train.py  --gradient_accumulation 16 --with_val --epochs 20 --tasks gembench --tag fine_grained_policy
      

Eval

Eval with Gembench.

bash scripts/eval_gembench.sh

Note: See detailed instructions for more information.

About

Code repo for ICLR 2026 paper "Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors