Hi ReconVLA Team,
Thank you for open-sourcing this impressive work! The reconstructive paradigm for VLA models is a very creative approach to tackling fine-grained visual attention in robotics.
In Step 2 (Generate target_image), the documentation mentions using object detection and grounding methods like Grounding DINO to extract gaze regions for datasets like LIBERO and CALVIN. I was wondering: do you provide official checkpoints for Grounding DINO that have been fine-tuned or specifically configured for the LIBERO, CALVIN, or BridgeData environments?
Having access to these weights or a more detailed processing script would be extremely helpful for reproducing your results.
Thank you for your time and for this great contribution to the field!
Best regards.
Hi ReconVLA Team,
Thank you for open-sourcing this impressive work! The reconstructive paradigm for VLA models is a very creative approach to tackling fine-grained visual attention in robotics.
In Step 2 (Generate target_image), the documentation mentions using object detection and grounding methods like Grounding DINO to extract gaze regions for datasets like LIBERO and CALVIN. I was wondering: do you provide official checkpoints for Grounding DINO that have been fine-tuned or specifically configured for the LIBERO, CALVIN, or BridgeData environments?
Having access to these weights or a more detailed processing script would be extremely helpful for reproducing your results.
Thank you for your time and for this great contribution to the field!
Best regards.