Skip to content

Karush2807/AeroMix

Repository files navigation

Real-Time Gesture-Based Audio Control System

This repository documents my implementation of a real-time gesture-based audio control system, inspired by [A Real-Time Gesture-Based Control Framework]. My goal is to build an interactive application that allows users to manipulate sound and music in real time using body movements, leveraging computer vision and machine learning.


Table of Contents


Project Motivation

Traditional audio control interfaces (knobs, sliders, keyboards) can be limiting for performers and interactive installations. Inspired by recent advances in gesture recognition and real-time sound control, I set out to create a system where users can intuitively manipulate music and sound using gestures, making audio interaction more expressive and accessible.


Reference Paper Summary

The reference paper presents a real-time, human-in-the-loop gesture control framework that adapts audio and music based on human movement via live video input. The system uses computer vision (MediaPipe) for landmark extraction, Max/MSP for multimedia processing, and Python-based machine learning for gesture classification. It enables users to train custom gestures (with as few as 50–80 samples), map them to audio controls, and perform real-time manipulation of sound features such as tempo, pitch, and effects.


System Overview

Key Features:

  • Real-time gesture recognition using MediaPipe
  • User-friendly gesture training and mapping workflow
  • Low-latency audio control via Pyo (Python audio synthesis library)
  • Modular design: Python for ML and audio, OSC for communication
  • Supports both performance scenarios (cue-based and continuous control)

Architecture

Main Components

Component Role
Pyo Real-time audio synthesis and manipulation in Python
MediaPipe Real-time body/hand/face landmark detection
Python Trains and runs gesture classification models (MLP)
OSC Real-time bridge between gesture recognition and audio

Data Flow

  1. Video Capture: Python/OpenCV captures webcam video.
  2. Landmark Extraction: MediaPipe extracts body/hand landmarks.
  3. Gesture Classification: Python classifies gestures (training or inference).
  4. Audio Control: Recognized gesture directly manipulates Pyo audio parameters in real time.

Setup and Installation

Prerequisites

  • Hardware: Computer with webcam and speakers/headphones
  • Software:
    • Python 3.x
    • Python packages: mediapipe, numpy, scikit-learn, python-osc, pyo, opencv-python
    • (Optional) Jupyter Notebook for experiments

Installation Steps

  1. Clone this repository.
  2. Install Python dependencies:
    pip install mediapipe numpy scikit-learn python-osc pyo opencv-python
    
  3. Configure OSC communication if needed.

Usage Guide

Training Phase

  1. Start the Python gesture training script to capture video and extract landmarks.
  2. Collect gesture samples:
    • Select a gesture to train (e.g., "raise right hand").
    • System guides you through sample collection with metronome/visual cues.
    • Perform the gesture at each cue; neutral movements fill "other" class.
  3. Model Training:
    • Python script trains an MLP classifier on your samples.
    • Review metrics; if accuracy is low, add more samples or corner cases.
    • Save the trained model and scaler for reuse.

Mapping Phase

  • In Python, assign each trained gesture to an audio control (e.g., volume up, trigger sound, change effect).
  • Choose scenario:
    • Performance cue (e.g., trigger song section)
    • Continuous control (e.g., adjust gain in real time)

Performance Phase

  • Start the Python application.
  • Perform gestures in front of the webcam.
  • System recognizes gestures and triggers mapped audio actions in real time (latency < 200 ms) using Pyo.

Experiments & Evaluation

  • Tested with multiple users and gestures (hand/leg raises, etc.).
  • Sample counts: 50–80 per gesture for stable accuracy.
  • Model: MLP classifier; achieved 90–95% accuracy for basic gestures.
  • Latency: End-to-end system response under 0.2 seconds.
  • Scenarios tested:
    • Dance performance cueing (timed gesture triggers)
    • Real-time sound control (volume/pitch/effect adjustment)
  • Comparative evaluation: Similar accuracy to MediaPipe’s built-in hand gesture recognition.

Key Learnings & Future Work

Learnings:

  • Modular architecture (Pyo + Python + OSC) enables flexible experimentation.
  • User-specific training is crucial for high accuracy; generalized models need diverse data.
  • Real-time feedback and low latency are achievable with careful engineering.

Future Work:

  • Enhance robustness to lighting/background changes.
  • Expand gesture vocabulary and support for more complex gestures.
  • Explore integration with other creative coding environments (e.g., Unity, TouchDesigner).
  • Investigate adaptive learning and user-independent models.

References

  • [A Real-Time Gesture-Based Control Framework]
  • [Real-time Hand Gesture Recognition - GitHub]

Acknowledgment:
This project is built as an independent implementation inspired by the architecture and methodology of the referenced paper.

About

Air DJ is a touchless DJ system using OpenCV and YOLO for real-time hand gesture recognition. Control bass, volume, scratching, looping, and effects with intuitive gestures. Features a dynamic music visualizer and multi-hand detection, offering an interactive, futuristic DJ experience.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages