Skip to content

Andrew-Bonilla/conductor-simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conducting Simulator Using Deep Learning

A real-time gesture-controlled orchestra system. Wave your arms in front of a webcam and control the tempo and volume of a live orchestral rendering of Beethoven's 5th Symphony using deep learning.

95.9% test accuracy on 5 gesture classes, running in real time on CPU.


Carnegie Hall performance that inspired this project

Why This Project Is Hard

Conducting is an inherently temporal art form. A downbeat isn't a pose, it's a motion through space over time. This makes it fundamentally harder than standard computer vision tasks like hand detection or face recognition, which classify a single static frame. A conducting gesture only makes sense as a sequence: the arc of the arm, the acceleration of the wrist, the shape the hand arrives in at the end.

This project tackles gesture recognition as a sequence classification problem — the model never sees a single frame in isolation. Instead it processes 30 consecutive frames of body and hand landmarks, learning the motion trajectories that define each gesture over time.

The deeper challenge was data. No large-scale conducting gesture dataset exists. With only 663 labelled training samples across 5 gesture classes, training a deep model from scratch is a losing battle, since there simply isn't enough data to learn meaningful motion representations. I solved this by treating it as a transfer learning problem: pretrain on a large related domain (American Sign Language, which shares body and hand motion structure with conducting), then fine-tune on the small target dataset. The result is a +6.1% improvement over training the same architecture from scratch. The pretraining learned how human motion looks, and I carried over this comprehension succesfully into a new, adjacent context.

Building this required working through the full deep learning engineering stack: dataset discovery and cleaning (WLASL had ~50% broken links), MediaPipe keypoint extraction pipelines, architecture design across multiple model iterations (MLP → single-stream LSTM → multi-stream BiLSTM), hyperparameter tuning, ablation studies, and finally integrating the trained model into a real-time audio system. Each failure taught something specific about why the next approach worked.


Demo

Five gestures control the orchestra in real time:

Gesture Effect
Left swipe Decrease tempo
Right swipe Increase tempo
Stop Silence orchestra
Thumbs down Decrease volume
Thumbs up Increase volume

How It Works

  1. Webcam captures live frames at 30 fps
  2. MediaPipe Holistic extracts 75 body and hand landmarks per frame (33 pose + 21 left hand + 21 right hand)
  3. A 30-frame rolling buffer feeds into a multi-stream BiLSTM model every 10 frames
  4. The predicted gesture triggers a real-time audio change via FluidSynth

Model Architecture

A multi-stream bidirectional LSTM processes each body region through its own parallel stream:

  • Pose stream — 33 landmarks, hidden=128, output=256
  • Left hand stream — 21 landmarks, hidden=64, output=128
  • Right hand stream — 21 landmarks, hidden=64, output=128

Each stream reads the 30-frame sequence both forward and backward (bidirectional), learning motion context in both temporal directions. Outputs are attention-pooled across frames — the model learns which moments in the gesture matter most — then concatenated to 512-dim, compressed through a fusion MLP to 128-dim, and classified.

The three-stream design reflects the structure of the problem: arm trajectory and hand shape carry different information. A swipe is defined by where the arm goes; a thumbs up is defined by what the hand does. Forcing both through one network loses that distinction. Separate streams let each specialise.

Transfer Learning

The model is pretrained on the ASL Citizen dataset (100 ASL sign classes, 3,544 training samples) achieving 99.8% validation accuracy, then fine-tuned on a 5-class gesture dataset (663 training samples) achieving 95.9% test accuracy — a +6.1% improvement over training from scratch (89.8%).


Installation

Requirements

  • Python 3.9+
  • macOS with Homebrew
  • A webcam

1. Install system dependencies

brew install sdl2 sdl2_image sdl2_mixer sdl2_ttf fluid-synth

2. Clone the repo and create a virtual environment

git clone https://github.com/Andrew-Bonilla/conductor-simulator.git
cd conductor-simulator
python3 -m venv .venv
source .venv/bin/activate

3. Install Python dependencies

pip install torch torchvision mediapipe opencv-python numpy \
            pretty_midi pygame pyfluidsynth

4. Download the SoundFont

This is the only file too large for GitHub (141MB). Download it and place it in demo/:

curl -L -o demo/orchestra.sf2 "https://github.com/urish/cinto/raw/master/media/FluidR3%20GM.sf2"

Everything else — model weights, MediaPipe task files, and the MIDI file — is already included in the repo inside demo/.


Running the Demo

cd demo
python3 conductor_simulator.py

Make sure your upper body and both hands are visible in the frame. Hold each gesture for about one second. Press Q to quit.


Reproducing Training

All training scripts are in scripts/. Run everything from the repo root.

1. Pretrain on ASL Citizen

Download the ASL Citizen dataset (pre-extracted MediaPipe keypoints) from Kaggle and place in data/asl_citizen/.

python scripts/multistream/train_multistream.py

Produces checkpoints/multistream/best_model.pt — 99.8% validation accuracy.

2. Extract gesture keypoints

Download the gesture recognition dataset from Kaggle and place raw videos in data/gesture_finetune_raw/.

python scripts/extract_keypoints_gesture_holistic.py

Produces data/gesture_keypoints_holistic/{train,val,test}/.

3. Fine-tune on gesture data

python scripts/multistream/finetune_gesture_holistic.py

Produces checkpoints/multistream/finetuned_gesture_holistic.pt — 95.9% test accuracy.

4. Run ablation (optional)

python scripts/multistream/train_gesture_scratch.py

Trains from scratch without ASL pretraining — 89.8% test accuracy, confirming the +6.1% transfer learning gain.


Results

Model Approach Test Accuracy
Multi-stream BiLSTM ASL pretrain + fine-tune (Holistic) 95.9%
Multi-stream BiLSTM From scratch, no pretraining 89.8%
Multi-stream BiLSTM Frozen head only 74.0% (val)
Multi-stream BiLSTM ASL pretraining only 99.8% (val, 100 classes)

Project Structure

conductor-simulator/
├── demo/
│   ├── conductor_simulator.py         # main real-time demo
│   ├── audio_engine.py                # FluidSynth orchestra engine
│   ├── finetuned_gesture_holistic.pt  # trained model weights
│   ├── pose_landmarker.task           # MediaPipe pose model
│   ├── hand_landmarker.task           # MediaPipe hand model
│   └── orchestra.mid                  # Beethoven 5th Symphony (public domain)
│   # orchestra.sf2 goes here too — download separately (see Installation)
├── scripts/
│   ├── multistream/
│   │   ├── model_multistream.py           # BiLSTM architecture
│   │   ├── train_multistream.py           # ASL pretraining
│   │   ├── finetune_gesture_holistic.py   # gesture fine-tuning
│   │   ├── finetune_gesture_proper.py     # proper split retraining
│   │   └── train_gesture_scratch.py       # ablation baseline
│   └── extract_keypoints_gesture_holistic.py
└── docs/
    ├── ConductorSimulator_Writeup.pdf
    └── ConductorSimulator_Poster.pdf

Documentation

A full technical writeup and conference-style poster are available in the docs/ folder, covering the architecture, training methodology, ablation study, and results in detail.

Course

Columbia University — COMS BC3168 Deep Learning for Computer Graphics — Spring 2026

Andrew Bonilla

About

A real-time gesture-controlled orchestra system. Wave your arms in front of a webcam and control the tempo and volume of a live orchestral rendering of Beethoven's 5th Symphony using deep learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages