A PyTorch implementation of the "Onsets and Frames" model for automatic piano transcription. Convert piano audio recordings into MIDI files using a combination of convolutional, recurrent, and traditional neural networks.
- Transcription: Uses the "Onsets and Frames" neural network architecture
- Multiple output formats: Generate MIDI files or JSON data with note timings
- Command-line interface: Easy-to-use CLI for batch processing
- GPU acceleration: Supports CUDA for faster inference
- Flexible thresholds: Adjustable onset and frame detection sensitivity
pip install git+https://github.com/winc3/piano-transcriber.gitgit clone https://github.com/winc3/piano-transcriber.git
cd piano-transcriber
pip install -e .Basic usage:
# Transcribe to MIDI
piano-transcriber input.wav -o output.mid
# Transcribe to JSON
piano-transcriber input.wav -f json -o output.jsonBatch processing:
# Process multiple files
piano-transcriber *.wav -o /path/to/output/
# Process with custom sensitivity
piano-transcriber input.wav --onset-threshold 0.3 --frame-threshold 0.4Advanced options:
# Use specific model
piano-transcriber input.wav -m path/to/model.pth
# Force CPU usage
piano-transcriber input.wav --device cpu
# Verbose output
piano-transcriber input.wav -vfrom piano_transcriber import PianoTranscriber
# Initialize transcriber
transcriber = PianoTranscriber("path/to/model.pth")
# Transcribe audio file
predictions = transcriber.transcribe_audio("input.wav")
# Convert to MIDI
midi = transcriber.predictions_to_midi(predictions, "output.mid")
# Convert to JSON
notes = transcriber.predictions_to_json(predictions)- WAV (.wav)
- MP3 (.mp3)
- FLAC (.flac)
- M4A (.m4a)
- OGG (.ogg)
The transcriber requires a trained model checkpoint. You can:
- Train your own model using the research components in this repository with your own data
- Use a pre-trained model a sublist of model checkpoints from training included in
piano_transcriber/model/sample_checkpoints/-⚠️ Non-commercial use only - Use the included model in
piano_transcriber/model/(if present) -⚠️ Non-commercial use only
- Architecture: Onsets and Frames neural network with CNN feature extraction and bidirectional LSTM
- Input: 16kHz audio, 229 mel-frequency bins
- Output: 88 piano keys (A0-C8, MIDI notes 21-108)
- Inference: Supports variable-length audio with chunking and overlap handling
Please feel free to submit issues, feature requests, or pull requests.
Code: MIT License - see LICENSE file for details. The source code can be used for any purpose, including commercial applications.
Pre-trained Models: Models trained on the MAESTRO dataset are restricted to non-commercial use only due to dataset licensing terms (CC BY-NC-SA 4.0). For commercial applications, you must train your own models using commercially-licensed data.
This implementation is based on the "Onsets and Frames" model for automatic music transcription. If you use this code in your research, please consider citing the original paper from the original authors (of which I am not a part of).
Reference: Onsets and Frames: Dual-Objective Piano Transcription
- Built with PyTorch and torchaudio
- Uses the MAESTRO dataset for training
- Inspired by Google's Onsets and Frames implementation