This laboratory course introduces Vision-Language Models (VLMs) and their use for controlling a robotic manipulator.
Students will design a system where:
- A human provides a natural language instruction
- A Vision-Language Model interprets the scene
- The model generates a task plan
- Task plan is converted into robot actions (x, y, z coordinates)
- The plan is executed in:
- Simulation (MuJoCo)
- Real robot (ABB GoFa)
The objective is to bridge machine learning, perception, and robot control in a structured workflow.
By the end of this lab, you should be able to:
- Understand what a Vision-Language Model is and how it works.
- Use an API-based AI model from Hugging Face.
- Process visual input (RGB image) and textual input.
- Convert high-level instructions (e.g., “pick the red cube”) into robot actions.
- Execute a task pipeline in simulation.
- Deploy the same logic to a real ABB GoFa robot.
- Evaluate limitations and failure cases.
You can find the PDF with the presentation of the project here.
- Project workflow
- Colab notebook
- MuJoCo simulation environment
- API keys on Huggingface
- Models
- Resources
A Vision-Language Model (VLM) is a neural network that can process:
- Images
- Text
- (Sometimes video)
It learns a shared representation between vision and language, allowing it to describe images, answer questions about images (Visual Question Answering), follow visual instructions (Visual prompting), generate action plans based on visual context and many other tasks.
In this lab, the VLM will receive:
Input
- RGB image of the scene
- Text instruction (e.g., “Pick the blue block and place it on the green block”)
Output
- A structured action plan (e.g., object → position → action sequence)
The action plan has to be converted into low-level control actions.
The project is organized into the following phases:
-
Background and Setup
- Define the problem, tools, and software environment
- Installation and configuration of required libraries and frameworks
- Set up MuJoCo simulation
- Create a Hugging Face account and get an API key
- Define the task (what the robot has to do)
-
Perception
Processing sensor data (e.g. RGB images, robot states) to extract meaningful information about the environment. Input:- RGB image (from simulation or real camera)
Tasks:
-
Capture image
-
Preprocess image with segmentation model, object detection (if necessary)
-
Format input for the VLM (textual prompt + image)
Output:
- A function that returns structured scene information.
-
Reasoning
Applying VLMs to interpret perceptual inputs and decide which actions the robot should take. Input:-
Scene image
-
User instruction
Task:
-
Query the VLM
-
Extract structured output
Here you may use prompt engineering and force structured JSON outputs to guide the reasoning and constrain response format.
Deliverable:
- A structured task plan
-
-
Action translation
Translating decisions ("pick red cube") into robot actions, such as motion ("x y z") and manipulation ("open/close gripper"). Convert semantic commands into:-
Target Cartesian coordinates
-
Gripper commands (open/close)
-
Motion sequence
Deliverable:
- Executable motion plan in MuJoCo.
-
-
Experiments Test the action plan on the simulated robot. Try it on the real robot only when you are satisfied with your simulations.
-
Test success rate
-
Analyze errors
-
Identify failure modes
-
-
Real Robot Deployment on ABB GoFa
Transfer validated pipeline to the physical robot. SAFETY AND VALIDATION ARE MANDATORY BEFORE EXECUTION!
-
Evaluation and Analysis
- Testing the system, analyzing performance, and discussing limitations and possible improvements
- Use different models and compare their performance
- Try different requests and check the task plan and robot execution
The following materials are provided:
- Colab notebook shows an example of Python implementation for a MuJoCo simulation
- List of models you can start using via Huggingface Inference Providers
- Documentation and tutorials explaining key concepts and implementation details
Students are encouraged to extend or modify the provided materials as part of the project.
The Colab notebook can be accessed here:
It demonstrates:
-
Loading MuJoCo
-
Rendering the scene
-
Capturing an RGB image
-
Sending image + prompt to a VLM
-
Receiving a response
Use it as reference for API formatting, prompt construction, image encoding.
Here you can find a MuJoCo simulation environment with an ABB GoFa robot for tabletop manipulation.

The model consists of an XML file and various .STL meshes of the robot parts.
If you want to modify the scene.xml file, refer to the MuJoCo XML Reference
This project uses models hosted on Hugging Face, a platform that provides access to many AI models through an API.
To use these models, you need a Hugging Face API key (also called an access token).
An API key is a private authentication token that allows your program to access hosted AI models. You can think of it as a password for your program that allows it to access Hugging Face models on your behalf.
- Do not share your API key publicly
-
Create a Hugging Face account
- Go to https://huggingface.co
- Click Sign Up (top right)
- Create an account using email, Google, or GitHub
-
Go to your Access Tokens page
- Once logged in, click on your profile picture (top right)
- Select Settings
- In the left sidebar, click Access Tokens
-
Create a new token
- Click New token
- Choose a name (e.g.
my-first-api-key) - Select Read as the role (this is sufficient for most projects)
- Select other options according to your needs
- Click Generate token
-
Copy and store the token
- Copy the generated token immediately
- Save it in a secure place
You will not be able to see it again after leaving the page.
Most projects expect the API key to be stored as an environment variable.
export HUGGINGFACE_API_KEY="your_token_here"
setx HUGGINGFACE_API_KEY "your_token_here"
Below is a non-exhaustive list of Hugging Face models that can be used with this project.
Each entry includes the model name, a link to its Hugging Face page, and the model size (number of parameters).
| Model Name | Hugging Face Page | Size (Parameters) |
|---|---|---|
Molmo2-8B |
https://huggingface.co/allenai/Molmo2-8B?inference_provider=publicai | 8B |
Qwen2.5-VL |
https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct?inference_provider=hyperbolic | 72B |
SmolVLM |
https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct | 2B |
Kimi-K2.5 |
https://huggingface.co/moonshotai/Kimi-K2.5?inference_provider=novita | 171B |
Explore more models from this page!
Choosing which model to use is not easy. Most of the times, you have to consider different aspects and make a trade-off decision. Here are some general indications
Smaller models (2B–8B):
-
Faster
-
Cheaper
-
Less reasoning capability
Larger models (70B+):
-
Better reasoning
-
Slower
-
Higher cost
For robotic planning:
Structured output quality is more important than conversational ability. Find the model that generates a good, structured plan that can be easily converted into robot actions.
- Vision Language Models Explained
- Vision Language Models (Better, faster, stronger)
- Demonstration-Free Robotic Control via LLM Agents article
- Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models article
- Visual Prompting in Multimodal Large Language Models: A Survey article
- MuJoCo official documentation
- MuJoCo Python tutorial
- A Tutorial on Visual Servo Control
- Tutorial - Image-based Visual Servo
- Eye-in-hand visual servoing
- ABB GoFa user manual
- What is Retrieval Augmented Generation (RAG)?
Please keep in mind that:
-
The VLM does NOT control the robot directly, it generates semantic reasoning.
-
You are responsible for translating that reasoning into safe robot commands.
-
ALWAYS validate motion plans in simulation before running on real hardware.