A real-time gesture-controlled orchestra system. Wave your arms in front of a webcam and control the tempo and volume of a live orchestral rendering of Beethoven's 5th Symphony using deep learning.
95.9% test accuracy on 5 gesture classes, running in real time on CPU.
Conducting is an inherently temporal art form. A downbeat isn't a pose, it's a motion through space over time. This makes it fundamentally harder than standard computer vision tasks like hand detection or face recognition, which classify a single static frame. A conducting gesture only makes sense as a sequence: the arc of the arm, the acceleration of the wrist, the shape the hand arrives in at the end.
This project tackles gesture recognition as a sequence classification problem — the model never sees a single frame in isolation. Instead it processes 30 consecutive frames of body and hand landmarks, learning the motion trajectories that define each gesture over time.
The deeper challenge was data. No large-scale conducting gesture dataset exists. With only 663 labelled training samples across 5 gesture classes, training a deep model from scratch is a losing battle, since there simply isn't enough data to learn meaningful motion representations. I solved this by treating it as a transfer learning problem: pretrain on a large related domain (American Sign Language, which shares body and hand motion structure with conducting), then fine-tune on the small target dataset. The result is a +6.1% improvement over training the same architecture from scratch. The pretraining learned how human motion looks, and I carried over this comprehension succesfully into a new, adjacent context.
Building this required working through the full deep learning engineering stack: dataset discovery and cleaning (WLASL had ~50% broken links), MediaPipe keypoint extraction pipelines, architecture design across multiple model iterations (MLP → single-stream LSTM → multi-stream BiLSTM), hyperparameter tuning, ablation studies, and finally integrating the trained model into a real-time audio system. Each failure taught something specific about why the next approach worked.
Five gestures control the orchestra in real time:
| Gesture | Effect |
|---|---|
| Left swipe | Decrease tempo |
| Right swipe | Increase tempo |
| Stop | Silence orchestra |
| Thumbs down | Decrease volume |
| Thumbs up | Increase volume |
- Webcam captures live frames at 30 fps
- MediaPipe Holistic extracts 75 body and hand landmarks per frame (33 pose + 21 left hand + 21 right hand)
- A 30-frame rolling buffer feeds into a multi-stream BiLSTM model every 10 frames
- The predicted gesture triggers a real-time audio change via FluidSynth
A multi-stream bidirectional LSTM processes each body region through its own parallel stream:
- Pose stream — 33 landmarks, hidden=128, output=256
- Left hand stream — 21 landmarks, hidden=64, output=128
- Right hand stream — 21 landmarks, hidden=64, output=128
Each stream reads the 30-frame sequence both forward and backward (bidirectional), learning motion context in both temporal directions. Outputs are attention-pooled across frames — the model learns which moments in the gesture matter most — then concatenated to 512-dim, compressed through a fusion MLP to 128-dim, and classified.
The three-stream design reflects the structure of the problem: arm trajectory and hand shape carry different information. A swipe is defined by where the arm goes; a thumbs up is defined by what the hand does. Forcing both through one network loses that distinction. Separate streams let each specialise.
The model is pretrained on the ASL Citizen dataset (100 ASL sign classes, 3,544 training samples) achieving 99.8% validation accuracy, then fine-tuned on a 5-class gesture dataset (663 training samples) achieving 95.9% test accuracy — a +6.1% improvement over training from scratch (89.8%).
- Python 3.9+
- macOS with Homebrew
- A webcam
brew install sdl2 sdl2_image sdl2_mixer sdl2_ttf fluid-synthgit clone https://github.com/Andrew-Bonilla/conductor-simulator.git
cd conductor-simulator
python3 -m venv .venv
source .venv/bin/activatepip install torch torchvision mediapipe opencv-python numpy \
pretty_midi pygame pyfluidsynthThis is the only file too large for GitHub (141MB). Download it and place it in demo/:
curl -L -o demo/orchestra.sf2 "https://github.com/urish/cinto/raw/master/media/FluidR3%20GM.sf2"Everything else — model weights, MediaPipe task files, and the MIDI file — is already included in the repo inside demo/.
cd demo
python3 conductor_simulator.pyMake sure your upper body and both hands are visible in the frame. Hold each gesture for about one second. Press Q to quit.
All training scripts are in scripts/. Run everything from the repo root.
Download the ASL Citizen dataset (pre-extracted MediaPipe keypoints) from Kaggle and place in data/asl_citizen/.
python scripts/multistream/train_multistream.pyProduces checkpoints/multistream/best_model.pt — 99.8% validation accuracy.
Download the gesture recognition dataset from Kaggle and place raw videos in data/gesture_finetune_raw/.
python scripts/extract_keypoints_gesture_holistic.pyProduces data/gesture_keypoints_holistic/{train,val,test}/.
python scripts/multistream/finetune_gesture_holistic.pyProduces checkpoints/multistream/finetuned_gesture_holistic.pt — 95.9% test accuracy.
python scripts/multistream/train_gesture_scratch.pyTrains from scratch without ASL pretraining — 89.8% test accuracy, confirming the +6.1% transfer learning gain.
| Model | Approach | Test Accuracy |
|---|---|---|
| Multi-stream BiLSTM | ASL pretrain + fine-tune (Holistic) | 95.9% |
| Multi-stream BiLSTM | From scratch, no pretraining | 89.8% |
| Multi-stream BiLSTM | Frozen head only | 74.0% (val) |
| Multi-stream BiLSTM | ASL pretraining only | 99.8% (val, 100 classes) |
conductor-simulator/
├── demo/
│ ├── conductor_simulator.py # main real-time demo
│ ├── audio_engine.py # FluidSynth orchestra engine
│ ├── finetuned_gesture_holistic.pt # trained model weights
│ ├── pose_landmarker.task # MediaPipe pose model
│ ├── hand_landmarker.task # MediaPipe hand model
│ └── orchestra.mid # Beethoven 5th Symphony (public domain)
│ # orchestra.sf2 goes here too — download separately (see Installation)
├── scripts/
│ ├── multistream/
│ │ ├── model_multistream.py # BiLSTM architecture
│ │ ├── train_multistream.py # ASL pretraining
│ │ ├── finetune_gesture_holistic.py # gesture fine-tuning
│ │ ├── finetune_gesture_proper.py # proper split retraining
│ │ └── train_gesture_scratch.py # ablation baseline
│ └── extract_keypoints_gesture_holistic.py
└── docs/
├── ConductorSimulator_Writeup.pdf
└── ConductorSimulator_Poster.pdf
A full technical writeup and conference-style poster are available in the docs/ folder, covering the architecture, training methodology, ablation study, and results in detail.
Columbia University — COMS BC3168 Deep Learning for Computer Graphics — Spring 2026
Andrew Bonilla
