Conducting Simulator Using Deep Learning

A real-time gesture-controlled orchestra system. Wave your arms in front of a webcam and control the tempo and volume of a live orchestral rendering of Beethoven's 5th Symphony using deep learning.

95.9% test accuracy on 5 gesture classes, running in real time on CPU.

Why This Project Is Hard

Conducting is an inherently temporal art form. A downbeat isn't a pose, it's a motion through space over time. This makes it fundamentally harder than standard computer vision tasks like hand detection or face recognition, which classify a single static frame. A conducting gesture only makes sense as a sequence: the arc of the arm, the acceleration of the wrist, the shape the hand arrives in at the end.

This project tackles gesture recognition as a sequence classification problem — the model never sees a single frame in isolation. Instead it processes 30 consecutive frames of body and hand landmarks, learning the motion trajectories that define each gesture over time.

The deeper challenge was data. No large-scale conducting gesture dataset exists. With only 663 labelled training samples across 5 gesture classes, training a deep model from scratch is a losing battle, since there simply isn't enough data to learn meaningful motion representations. I solved this by treating it as a transfer learning problem: pretrain on a large related domain (American Sign Language, which shares body and hand motion structure with conducting), then fine-tune on the small target dataset. The result is a +6.1% improvement over training the same architecture from scratch. The pretraining learned how human motion looks, and I carried over this comprehension succesfully into a new, adjacent context.

Building this required working through the full deep learning engineering stack: dataset discovery and cleaning (WLASL had ~50% broken links), MediaPipe keypoint extraction pipelines, architecture design across multiple model iterations (MLP → single-stream LSTM → multi-stream BiLSTM), hyperparameter tuning, ablation studies, and finally integrating the trained model into a real-time audio system. Each failure taught something specific about why the next approach worked.

Demo

Five gestures control the orchestra in real time:

Gesture	Effect
Left swipe	Decrease tempo
Right swipe	Increase tempo
Stop	Silence orchestra
Thumbs down	Decrease volume
Thumbs up	Increase volume

How It Works

Webcam captures live frames at 30 fps
MediaPipe Holistic extracts 75 body and hand landmarks per frame (33 pose + 21 left hand + 21 right hand)
A 30-frame rolling buffer feeds into a multi-stream BiLSTM model every 10 frames
The predicted gesture triggers a real-time audio change via FluidSynth

Model Architecture

A multi-stream bidirectional LSTM processes each body region through its own parallel stream:

Pose stream — 33 landmarks, hidden=128, output=256
Left hand stream — 21 landmarks, hidden=64, output=128
Right hand stream — 21 landmarks, hidden=64, output=128

Each stream reads the 30-frame sequence both forward and backward (bidirectional), learning motion context in both temporal directions. Outputs are attention-pooled across frames — the model learns which moments in the gesture matter most — then concatenated to 512-dim, compressed through a fusion MLP to 128-dim, and classified.

The three-stream design reflects the structure of the problem: arm trajectory and hand shape carry different information. A swipe is defined by where the arm goes; a thumbs up is defined by what the hand does. Forcing both through one network loses that distinction. Separate streams let each specialise.

Transfer Learning

The model is pretrained on the ASL Citizen dataset (100 ASL sign classes, 3,544 training samples) achieving 99.8% validation accuracy, then fine-tuned on a 5-class gesture dataset (663 training samples) achieving 95.9% test accuracy — a +6.1% improvement over training from scratch (89.8%).

Installation

Requirements

Python 3.9+
macOS with Homebrew
A webcam

1. Install system dependencies

brew install sdl2 sdl2_image sdl2_mixer sdl2_ttf fluid-synth

2. Clone the repo and create a virtual environment

git clone https://github.com/Andrew-Bonilla/conductor-simulator.git
cd conductor-simulator
python3 -m venv .venv
source .venv/bin/activate

3. Install Python dependencies

pip install torch torchvision mediapipe opencv-python numpy \
            pretty_midi pygame pyfluidsynth

4. Download the SoundFont

This is the only file too large for GitHub (141MB). Download it and place it in demo/:

curl -L -o demo/orchestra.sf2 "https://github.com/urish/cinto/raw/master/media/FluidR3%20GM.sf2"

Everything else — model weights, MediaPipe task files, and the MIDI file — is already included in the repo inside demo/.

Running the Demo

cd demo
python3 conductor_simulator.py

Make sure your upper body and both hands are visible in the frame. Hold each gesture for about one second. Press Q to quit.

Reproducing Training

All training scripts are in scripts/. Run everything from the repo root.

1. Pretrain on ASL Citizen

Download the ASL Citizen dataset (pre-extracted MediaPipe keypoints) from Kaggle and place in data/asl_citizen/.

python scripts/multistream/train_multistream.py

Produces checkpoints/multistream/best_model.pt — 99.8% validation accuracy.

2. Extract gesture keypoints

Download the gesture recognition dataset from Kaggle and place raw videos in data/gesture_finetune_raw/.

python scripts/extract_keypoints_gesture_holistic.py

Produces data/gesture_keypoints_holistic/{train,val,test}/.

3. Fine-tune on gesture data

python scripts/multistream/finetune_gesture_holistic.py

Produces checkpoints/multistream/finetuned_gesture_holistic.pt — 95.9% test accuracy.

4. Run ablation (optional)

python scripts/multistream/train_gesture_scratch.py

Trains from scratch without ASL pretraining — 89.8% test accuracy, confirming the +6.1% transfer learning gain.

Results

Model	Approach	Test Accuracy
Multi-stream BiLSTM	ASL pretrain + fine-tune (Holistic)	95.9%
Multi-stream BiLSTM	From scratch, no pretraining	89.8%
Multi-stream BiLSTM	Frozen head only	74.0% (val)
Multi-stream BiLSTM	ASL pretraining only	99.8% (val, 100 classes)

Project Structure

conductor-simulator/
├── demo/
│   ├── conductor_simulator.py         # main real-time demo
│   ├── audio_engine.py                # FluidSynth orchestra engine
│   ├── finetuned_gesture_holistic.pt  # trained model weights
│   ├── pose_landmarker.task           # MediaPipe pose model
│   ├── hand_landmarker.task           # MediaPipe hand model
│   └── orchestra.mid                  # Beethoven 5th Symphony (public domain)
│   # orchestra.sf2 goes here too — download separately (see Installation)
├── scripts/
│   ├── multistream/
│   │   ├── model_multistream.py           # BiLSTM architecture
│   │   ├── train_multistream.py           # ASL pretraining
│   │   ├── finetune_gesture_holistic.py   # gesture fine-tuning
│   │   ├── finetune_gesture_proper.py     # proper split retraining
│   │   └── train_gesture_scratch.py       # ablation baseline
│   └── extract_keypoints_gesture_holistic.py
└── docs/
    ├── ConductorSimulator_Writeup.pdf
    └── ConductorSimulator_Poster.pdf

Documentation

A full technical writeup and conference-style poster are available in the docs/ folder, covering the architecture, training methodology, ablation study, and results in detail.

Course

Columbia University — COMS BC3168 Deep Learning for Computer Graphics — Spring 2026

Andrew Bonilla

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
demo		demo
docs		docs
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conducting Simulator Using Deep Learning

Why This Project Is Hard

Demo

How It Works

Model Architecture

Transfer Learning

Installation

Requirements

1. Install system dependencies

2. Clone the repo and create a virtual environment

3. Install Python dependencies

4. Download the SoundFont

Running the Demo

Reproducing Training

1. Pretrain on ASL Citizen

2. Extract gesture keypoints

3. Fine-tune on gesture data

4. Run ablation (optional)

Results

Project Structure

Documentation

Course

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Conducting Simulator Using Deep Learning

Why This Project Is Hard

Demo

How It Works

Model Architecture

Transfer Learning

Installation

Requirements

1. Install system dependencies

2. Clone the repo and create a virtual environment

3. Install Python dependencies

4. Download the SoundFont

Running the Demo

Reproducing Training

1. Pretrain on ASL Citizen

2. Extract gesture keypoints

3. Fine-tune on gesture data

4. Run ablation (optional)

Results

Project Structure

Documentation

Course

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages