Real-Time Gesture-Based Audio Control System

This repository documents my implementation of a real-time gesture-based audio control system, inspired by [A Real-Time Gesture-Based Control Framework]. My goal is to build an interactive application that allows users to manipulate sound and music in real time using body movements, leveraging computer vision and machine learning.

Project Motivation

Traditional audio control interfaces (knobs, sliders, keyboards) can be limiting for performers and interactive installations. Inspired by recent advances in gesture recognition and real-time sound control, I set out to create a system where users can intuitively manipulate music and sound using gestures, making audio interaction more expressive and accessible.

Reference Paper Summary

The reference paper presents a real-time, human-in-the-loop gesture control framework that adapts audio and music based on human movement via live video input. The system uses computer vision (MediaPipe) for landmark extraction, Max/MSP for multimedia processing, and Python-based machine learning for gesture classification. It enables users to train custom gestures (with as few as 50–80 samples), map them to audio controls, and perform real-time manipulation of sound features such as tempo, pitch, and effects.

System Overview

Key Features:

Real-time gesture recognition using MediaPipe
User-friendly gesture training and mapping workflow
Low-latency audio control via Pyo (Python audio synthesis library)
Modular design: Python for ML and audio, OSC for communication
Supports both performance scenarios (cue-based and continuous control)

Architecture

Main Components

Component	Role
Pyo	Real-time audio synthesis and manipulation in Python
MediaPipe	Real-time body/hand/face landmark detection
Python	Trains and runs gesture classification models (MLP)
OSC	Real-time bridge between gesture recognition and audio

Data Flow

Video Capture: Python/OpenCV captures webcam video.
Landmark Extraction: MediaPipe extracts body/hand landmarks.
Gesture Classification: Python classifies gestures (training or inference).
Audio Control: Recognized gesture directly manipulates Pyo audio parameters in real time.

Setup and Installation

Prerequisites

Hardware: Computer with webcam and speakers/headphones
Software:
- Python 3.x
- Python packages: mediapipe, numpy, scikit-learn, python-osc, pyo, opencv-python
- (Optional) Jupyter Notebook for experiments

Installation Steps

Clone this repository.

Install Python dependencies:

pip install mediapipe numpy scikit-learn python-osc pyo opencv-python

Configure OSC communication if needed.

Usage Guide

Training Phase

Start the Python gesture training script to capture video and extract landmarks.
Collect gesture samples:
- Select a gesture to train (e.g., "raise right hand").
- System guides you through sample collection with metronome/visual cues.
- Perform the gesture at each cue; neutral movements fill "other" class.
Model Training:
- Python script trains an MLP classifier on your samples.
- Review metrics; if accuracy is low, add more samples or corner cases.
- Save the trained model and scaler for reuse.

Mapping Phase

In Python, assign each trained gesture to an audio control (e.g., volume up, trigger sound, change effect).
Choose scenario:
- Performance cue (e.g., trigger song section)
- Continuous control (e.g., adjust gain in real time)

Performance Phase

Start the Python application.
Perform gestures in front of the webcam.
System recognizes gestures and triggers mapped audio actions in real time (latency < 200 ms) using Pyo.

Experiments & Evaluation

Tested with multiple users and gestures (hand/leg raises, etc.).
Sample counts: 50–80 per gesture for stable accuracy.
Model: MLP classifier; achieved 90–95% accuracy for basic gestures.
Latency: End-to-end system response under 0.2 seconds.
Scenarios tested:
- Dance performance cueing (timed gesture triggers)
- Real-time sound control (volume/pitch/effect adjustment)
Comparative evaluation: Similar accuracy to MediaPipe’s built-in hand gesture recognition.

Key Learnings & Future Work

Learnings:

Modular architecture (Pyo + Python + OSC) enables flexible experimentation.
User-specific training is crucial for high accuracy; generalized models need diverse data.
Real-time feedback and low latency are achievable with careful engineering.

Future Work:

Enhance robustness to lighting/background changes.
Expand gesture vocabulary and support for more complex gestures.
Explore integration with other creative coding environments (e.g., Unity, TouchDesigner).
Investigate adaptive learning and user-independent models.

References

[A Real-Time Gesture-Based Control Framework]
[Real-time Hand Gesture Recognition - GitHub]

Acknowledgment:
This project is built as an independent implementation inspired by the architecture and methodology of the referenced paper.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
data/audio		data/audio
model/trained		model/trained
src		src
.gitignore		.gitignore
A Real-Time Gesture-Based Control Framework.pdf		A Real-Time Gesture-Based Control Framework.pdf
LICENSE		LICENSE
README.md		README.md
requirements.rxt		requirements.rxt
requirements_frontend.txt		requirements_frontend.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Gesture-Based Audio Control System

Table of Contents

Project Motivation

Reference Paper Summary

System Overview

Architecture

Main Components

Data Flow

Setup and Installation

Prerequisites

Installation Steps

Usage Guide

Training Phase

Mapping Phase

Performance Phase

Experiments & Evaluation

Key Learnings & Future Work

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real-Time Gesture-Based Audio Control System

Table of Contents

Project Motivation

Reference Paper Summary

System Overview

Architecture

Main Components

Data Flow

Setup and Installation

Prerequisites

Installation Steps

Usage Guide

Training Phase

Mapping Phase

Performance Phase

Experiments & Evaluation

Key Learnings & Future Work

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages