Machine Learning Lab 2026

Vision-Language Models for Robotic Manipulation

This laboratory course introduces Vision-Language Models (VLMs) and their use for controlling a robotic manipulator.

Students will design a system where:

A human provides a natural language instruction
A Vision-Language Model interprets the scene
The model generates a task plan
Task plan is converted into robot actions (x, y, z coordinates)
The plan is executed in:
- Simulation (MuJoCo)
- Real robot (ABB GoFa)

The objective is to bridge machine learning, perception, and robot control in a structured workflow.

Intended Learning Outcomes

By the end of this lab, you should be able to:

Understand what a Vision-Language Model is and how it works.
Use an API-based AI model from Hugging Face.
Process visual input (RGB image) and textual input.
Convert high-level instructions (e.g., “pick the red cube”) into robot actions.
Execute a task pipeline in simulation.
Deploy the same logic to a real ABB GoFa robot.
Evaluate limitations and failure cases.

Project Presentation

You can find the PDF with the presentation of the project here.

Quick links

What is a Vision-Language Model?

A Vision-Language Model (VLM) is a neural network that can process:

Images
Text
(Sometimes video)

It learns a shared representation between vision and language, allowing it to describe images, answer questions about images (Visual Question Answering), follow visual instructions (Visual prompting), generate action plans based on visual context and many other tasks.

In this lab, the VLM will receive:

Input

RGB image of the scene
Text instruction (e.g., “Pick the blue block and place it on the green block”)

Output

A structured action plan (e.g., object → position → action sequence)

The action plan has to be converted into low-level control actions.

Project Workflow

The project is organized into the following phases:

Background and Setup
- Define the problem, tools, and software environment
- Installation and configuration of required libraries and frameworks
- Set up MuJoCo simulation
- Create a Hugging Face account and get an API key
- Define the task (what the robot has to do)
Perception
Processing sensor data (e.g. RGB images, robot states) to extract meaningful information about the environment. Input:
- RGB image (from simulation or real camera)
Tasks:
- Capture image
- Preprocess image with segmentation model, object detection (if necessary)
- Format input for the VLM (textual prompt + image)
Output:
- A function that returns structured scene information.
Reasoning
Applying VLMs to interpret perceptual inputs and decide which actions the robot should take. Input:
- Scene image
- User instruction
Task:
- Query the VLM
- Extract structured output
Here you may use prompt engineering and force structured JSON outputs to guide the reasoning and constrain response format.

Deliverable:
- A structured task plan
Action translation
Translating decisions ("pick red cube") into robot actions, such as motion ("x y z") and manipulation ("open/close gripper"). Convert semantic commands into:
- Target Cartesian coordinates
- Gripper commands (open/close)
- Motion sequence
Deliverable:
- Executable motion plan in MuJoCo.
Experiments Test the action plan on the simulated robot. Try it on the real robot only when you are satisfied with your simulations.
- Test success rate
- Analyze errors
- Identify failure modes
Real Robot Deployment on ABB GoFa

Transfer validated pipeline to the physical robot. SAFETY AND VALIDATION ARE MANDATORY BEFORE EXECUTION!
Evaluation and Analysis
- Testing the system, analyzing performance, and discussing limitations and possible improvements
- Use different models and compare their performance
- Try different requests and check the task plan and robot execution

Materials Provided

The following materials are provided:

Colab notebook shows an example of Python implementation for a MuJoCo simulation
List of models you can start using via Huggingface Inference Providers
Documentation and tutorials explaining key concepts and implementation details

Students are encouraged to extend or modify the provided materials as part of the project.

Colab tutorial

The Colab notebook can be accessed here:

It demonstrates:

Loading MuJoCo
Rendering the scene
Capturing an RGB image
Sending image + prompt to a VLM
Receiving a response

Use it as reference for API formatting, prompt construction, image encoding.

MuJoCo simulation

Here you can find a MuJoCo simulation environment with an ABB GoFa robot for tabletop manipulation.

The model consists of an XML file and various .STL meshes of the robot parts. If you want to modify the scene.xml file, refer to the MuJoCo XML Reference

Getting a Hugging Face API Key

This project uses models hosted on Hugging Face, a platform that provides access to many AI models through an API.
To use these models, you need a Hugging Face API key (also called an access token).

What is an API key?

An API key is a private authentication token that allows your program to access hosted AI models. You can think of it as a password for your program that allows it to access Hugging Face models on your behalf.

⚠️ Important:

Do not share your API key publicly

Create a Hugging Face API key

Create a Hugging Face account
- Go to https://huggingface.co
- Click Sign Up (top right)
- Create an account using email, Google, or GitHub
Go to your Access Tokens page
- Once logged in, click on your profile picture (top right)
- Select Settings
- In the left sidebar, click Access Tokens
Create a new token
- Click New token
- Choose a name (e.g. my-first-api-key)
- Select Read as the role (this is sufficient for most projects)
- Select other options according to your needs
- Click Generate token
Copy and store the token
- Copy the generated token immediately
- Save it in a secure place
You will not be able to see it again after leaving the page.

Using the API key in your project

Most projects expect the API key to be stored as an environment variable.

On Linux / macOS

export HUGGINGFACE_API_KEY="your_token_here"

On Windows

setx HUGGINGFACE_API_KEY "your_token_here"

Available Vision-Language Models

Below is a non-exhaustive list of Hugging Face models that can be used with this project.
Each entry includes the model name, a link to its Hugging Face page, and the model size (number of parameters).

Model List

Model Name	Hugging Face Page	Size (Parameters)
`Molmo2-8B`	https://huggingface.co/allenai/Molmo2-8B?inference_provider=publicai	8B
`Qwen2.5-VL`	https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct?inference_provider=hyperbolic	72B
`SmolVLM`	https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct	2B
`Kimi-K2.5`	https://huggingface.co/moonshotai/Kimi-K2.5?inference_provider=novita	171B

Explore more models from this page!

Model Selection Guidelines

Choosing which model to use is not easy. Most of the times, you have to consider different aspects and make a trade-off decision. Here are some general indications

Smaller models (2B–8B):

Faster
Cheaper
Less reasoning capability

Larger models (70B+):

Better reasoning
Slower
Higher cost

For robotic planning:

Structured output quality is more important than conversational ability. Find the model that generates a good, structured plan that can be easily converted into robot actions.

Resources

Vision Language Models Explained
Vision Language Models (Better, faster, stronger)
Demonstration-Free Robotic Control via LLM Agents article
Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models article
Visual Prompting in Multimodal Large Language Models: A Survey article
MuJoCo official documentation
MuJoCo Python tutorial
A Tutorial on Visual Servo Control
Tutorial - Image-based Visual Servo
Eye-in-hand visual servoing
ABB GoFa user manual
What is Retrieval Augmented Generation (RAG)?

Important Notes

Please keep in mind that:

The VLM does NOT control the robot directly, it generates semantic reasoning.
You are responsible for translating that reasoning into safe robot commands.
ALWAYS validate motion plans in simulation before running on real hardware.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
img		img
Lab_example.ipynb		Lab_example.ipynb
README.md		README.md
project_presentation.pdf		project_presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Lab 2026

Vision-Language Models for Robotic Manipulation

Intended Learning Outcomes

Project Presentation

Quick links

What is a Vision-Language Model?

Project Workflow

Materials Provided

Colab tutorial

MuJoCo simulation

Getting a Hugging Face API Key

What is an API key?

Create a Hugging Face API key

Using the API key in your project

On Linux / macOS

On Windows

Available Vision-Language Models

Model List

Model Selection Guidelines

Resources

Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Lab 2026

Vision-Language Models for Robotic Manipulation

Intended Learning Outcomes

Project Presentation

Quick links

What is a Vision-Language Model?

Project Workflow

Materials Provided

Colab tutorial

MuJoCo simulation

Getting a Hugging Face API Key

What is an API key?

Create a Hugging Face API key

Using the API key in your project

On Linux / macOS

On Windows

Available Vision-Language Models

Model List

Model Selection Guidelines

Resources

Important Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages