Conformer with multi-scale local attention and *(periodic positional encoding) for composer classification. See coma-gen for a similar architecture used for music generation.
Model Architecture (see src/transformer.py):
-
Embedding: REMI token embedding + scaled sinusoidal positional embedding.
-
Encoder: Stack of conformer-like blocks1 (FeedForward → Multi-Scale Local Attention → Convolution Module → FeedForward, with LayerNorm and residuals).
- Attention: Multi-scale local self-attention (windowed, not full sequence). Scales aggregated via a weighted sum (learnable weight for each scale). Inspired by the multi-scale attention mechanism in Cui et al.2
- Convolution Module: pointwise convolution (w/ expansion factor of 2) -> GLU activation -> 1D Depthwise convolution -> Batchnorm -> Swish activation.
-
Sequence Attention: After encoding, a linear layer computes attention weights over the sequence, producing a weighted sum (sequence embedding).
-
Classifier: MLP (LayerNorm → Linear → GELU → Dropout → Linear) to output logits for composer classes.
- *periodic positional encoding2.
Create a conda environment with python 3.11:
conda create -n coma python=3.11
conda activate comaInstall dependencies:
pip install -r requirements.txtDownload the Maestro 3.0 dataset3
wget https://storage.googleapis.com/magentadata/datasets/maestro/v3.0.0/maestro-v3.0.0-midi.zip
unzip 'maestro-v3.0.0-midi.zip'
rm 'maestro-v3.0.0-midi.zip'
mv 'maestro-v3.0.0' 'data/maestro-v3.0.0'Data Split & Preprocessing:
There are various options for data preparation and splitting:
-
Tokenizer: Uses miditok REMI tokenizer, either loaded, untrained, or trained from scratch on the training set to a target vocab size.
-
Select Composers: Only top K composers (by number of compositions or total duration) are selected (
TOP_K_COMPOSERSin config). -
Train/Test splits: For each composer, compositions are split so that no composition appears in more than one split (ensures no data leakage).
-
Shuffle (recommended): Optionally shuffles before splitting (maintaining that no composition appears in more than one split). This creates a stratified split based on
TEST_SIZEin config. IfSHUFFLE=False, the data split provided in theMAESTROdataset is used. -
Augmentation: Optionally applies pitch, velocity, and duration augmentations to training data.
Adjust training params in config.py and begin training the transformer with
python3 train.pyTensorboard logs and eval plots will be saved in the specified LOG_DIR directory. View the logs with
tensorboard --logdir=<LOG_DIR>Training Details:
-
Loss: Cross-entropy loss for multi-class classification.
-
Optimizer: AdamW.
-
LR Scheduler: MultiStepLR or CosineAnnealing.
-
Metrics: Tracks accuracy and F1-score, both at chunk and composition level (majority voting or confidence aggregation).
Preliminary results (top K by number of compositions, 80:20 shuffled split, 20 epochs):
| # composers | Composition F1 (Confidence Agg) | Composition F1 (Majority Vote) | Chunk F1 | # params |
|---|---|---|---|---|
| 3 (config) | 0.98 | 0.98 | 0.84 | 406,948 |
| 5 (config) | 0.97 | 0.97 | 0.86 | 402,921 |
| 10 (config) | 0.90 | 0.89 | 0.69 | 407,822 |
| 13 (config) | 0.87 | 0.82 | 0.68 | 408,689 |
Deep Composer Classification Using Symbolic Representation (2020)(code)
Visual-based Musical Data Representation for Composer Classification (2022)
ComposeInStyle: Music composition with and without Style Transfer (2021)
Composer Classification with Cross-modal Transfer Learning and Musically-informed Augmentation (2021) (zero-shot)
Automated Thematic Composer Classification Using Segment Retrieval (2024)
Concept-Based Explanations For Composer Classification (2022)(code)
The following work achieves perfect acc/f1. Looking at their code, it appears that there is data leakage b/w the train and test sets. Their dataset (on which they do a random train/test split) for the 5 composer classification task has at most 482 unique compositions but 809 total recordings.
NLP-based music processing for composer classification (2023)
This repo is largely adapted from the following.
local attention: https://github.com/lucidrains/local-attention
conformer: https://github.com/jreremy/conformer, https://github.com/lucidrains/conformer
miditok: https://github.com/Natooz/MidiTok
Footnotes
-
Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. ArXiv, abs/2005.08100. ↩
-
Cui XH, Hu P, Huang Z. Music sequence generation and arrangement based on transformer model. Journal of Computational Methods in Sciences and Engineering. 2025;0(0). doi:10.1177/14727978251337904. ↩ ↩2
-
Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C.A., Dieleman, S., Elsen, E., Engel, J., & Eck, D. (2018). Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. ArXiv, abs/1810.12247. ↩
