This repository documents my implementation of a real-time gesture-based audio control system, inspired by [A Real-Time Gesture-Based Control Framework]. My goal is to build an interactive application that allows users to manipulate sound and music in real time using body movements, leveraging computer vision and machine learning.
- Project Motivation
- Reference Paper Summary
- System Overview
- Architecture
- Setup and Installation
- Usage Guide
- Experiments & Evaluation
- Key Learnings & Future Work
- References
Traditional audio control interfaces (knobs, sliders, keyboards) can be limiting for performers and interactive installations. Inspired by recent advances in gesture recognition and real-time sound control, I set out to create a system where users can intuitively manipulate music and sound using gestures, making audio interaction more expressive and accessible.
The reference paper presents a real-time, human-in-the-loop gesture control framework that adapts audio and music based on human movement via live video input. The system uses computer vision (MediaPipe) for landmark extraction, Max/MSP for multimedia processing, and Python-based machine learning for gesture classification. It enables users to train custom gestures (with as few as 50–80 samples), map them to audio controls, and perform real-time manipulation of sound features such as tempo, pitch, and effects.
Key Features:
- Real-time gesture recognition using MediaPipe
- User-friendly gesture training and mapping workflow
- Low-latency audio control via Pyo (Python audio synthesis library)
- Modular design: Python for ML and audio, OSC for communication
- Supports both performance scenarios (cue-based and continuous control)
| Component | Role |
|---|---|
| Pyo | Real-time audio synthesis and manipulation in Python |
| MediaPipe | Real-time body/hand/face landmark detection |
| Python | Trains and runs gesture classification models (MLP) |
| OSC | Real-time bridge between gesture recognition and audio |
- Video Capture: Python/OpenCV captures webcam video.
- Landmark Extraction: MediaPipe extracts body/hand landmarks.
- Gesture Classification: Python classifies gestures (training or inference).
- Audio Control: Recognized gesture directly manipulates Pyo audio parameters in real time.
- Hardware: Computer with webcam and speakers/headphones
- Software:
- Python 3.x
- Python packages:
mediapipe,numpy,scikit-learn,python-osc,pyo,opencv-python - (Optional) Jupyter Notebook for experiments
- Clone this repository.
- Install Python dependencies:
pip install mediapipe numpy scikit-learn python-osc pyo opencv-python - Configure OSC communication if needed.
- Start the Python gesture training script to capture video and extract landmarks.
- Collect gesture samples:
- Select a gesture to train (e.g., "raise right hand").
- System guides you through sample collection with metronome/visual cues.
- Perform the gesture at each cue; neutral movements fill "other" class.
- Model Training:
- Python script trains an MLP classifier on your samples.
- Review metrics; if accuracy is low, add more samples or corner cases.
- Save the trained model and scaler for reuse.
- In Python, assign each trained gesture to an audio control (e.g., volume up, trigger sound, change effect).
- Choose scenario:
- Performance cue (e.g., trigger song section)
- Continuous control (e.g., adjust gain in real time)
- Start the Python application.
- Perform gestures in front of the webcam.
- System recognizes gestures and triggers mapped audio actions in real time (latency < 200 ms) using Pyo.
- Tested with multiple users and gestures (hand/leg raises, etc.).
- Sample counts: 50–80 per gesture for stable accuracy.
- Model: MLP classifier; achieved 90–95% accuracy for basic gestures.
- Latency: End-to-end system response under 0.2 seconds.
- Scenarios tested:
- Dance performance cueing (timed gesture triggers)
- Real-time sound control (volume/pitch/effect adjustment)
- Comparative evaluation: Similar accuracy to MediaPipe’s built-in hand gesture recognition.
Learnings:
- Modular architecture (Pyo + Python + OSC) enables flexible experimentation.
- User-specific training is crucial for high accuracy; generalized models need diverse data.
- Real-time feedback and low latency are achievable with careful engineering.
Future Work:
- Enhance robustness to lighting/background changes.
- Expand gesture vocabulary and support for more complex gestures.
- Explore integration with other creative coding environments (e.g., Unity, TouchDesigner).
- Investigate adaptive learning and user-independent models.
- [A Real-Time Gesture-Based Control Framework]
- [Real-time Hand Gesture Recognition - GitHub]
Acknowledgment:
This project is built as an independent implementation inspired by the architecture and methodology of the referenced paper.