Skip to content

Training speed mismatch #35

@dragonlzm

Description

@dragonlzm

Hello, Thanks for the great works again. I am trying to use the following command (which is almost the same as the command your repo provided)to reproduce the training on Libero with 4H100.

It turn out it will take about 28 hours to train, which is mismatch with 5 hours the repo provided.
Image

Also GPU use also now fully utilized, is this normal?

Image

Do you have any thought why the training time is much longer? Thanks and looking forward to hear from you!

data_name=libero_spatial_no_noops

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune.py \
--vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \
--config_file_path pretrained_models/configs \
--data_root_dir data/libero \
--dataset_name $data_name \
--run_root_dir outputs \
--use_film False \
--num_images_in_input 2 \
--use_proprio True \
--use_lora True \
--use_fz False \
--use_minivlm True \
--image_aug True \
--num_steps_before_decay 150000 \
--max_steps 150005 \
--save_freq 5000 \
--save_latest_checkpoint_only False \
--merge_lora_during_training True \
--batch_size 16 \
--grad_accumulation_steps 1 \
--learning_rate 2e-4 \
--lora_rank 64 \
--use_pro_version True \
--wandb_project "$data_name" \
--run_id_note VLA-Adapter--spatial--testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions