Emonity: Speech Emotion Recognition with MFCC and Deep Learning

This project implements a robust Speech Emotion Recognition (SER) system using Mel-Frequency Cepstral Coefficients (MFCC) and advanced deep learning models in PyTorch. It leverages multiple public emotional speech datasets, extensive data augmentation, and ensemble learning to achieve high accuracy and real-time inference.

Overview

This repository provides a complete pipeline for recognizing emotions from speech audio using MFCC and other audio features, with models built in PyTorch. The system supports:

Data loading and preprocessing from several popular datasets
Advanced feature extraction (MFCC, log-mel, spectral, chroma, ZCR, RMS)
Data augmentation (noise, pitch, speed, etc.)
Training of multiple deep learning models (1D CNN, 2D CNN, CNN-BiLSTM)
Ensemble learning for improved accuracy
Real-time inference pipeline

Features

Multi-dataset support: CREMA-D, RAVDESS, TESS, SAVEE
Advanced feature extraction: MFCCs (with deltas), log-mel spectrograms, spectral/chroma features
Data augmentation: Noise injection, pitch shift, speed change, etc.
Deep learning models:
- 1D CNN with self-attention
- 2D CNN (ResNet-like)
- CNN-BiLSTM with attention
Ensemble model: Combines all models for best performance
Real-time inference: Fast, low-latency prediction pipeline
Comprehensive evaluation: Accuracy, precision, recall, F1-score, confusion matrix

Datasets

The following datasets are used (see dataset/ directory):

Dataset Structure Example:

dataset/
  cremad/AudioWAV/
  ravdess-emotional-speech-audio/audio_speech_actors_01-24/
  surrey-audiovisual-expressed-emotion-savee/ALL/
  toronto-emotional-speech-set-tess/tess toronto emotional speech set data/TESS Toronto emotional speech set data/

Note: Download and extract datasets as per their respective licenses. The notebook expects the above structure.

Installation

Clone the repository:
```
git clone <repo-url>
cd emonity
```

Install dependencies (Python 3.8+ recommended):

pip install -r requirements.txt

Or, install manually:

pip install pandas numpy librosa seaborn matplotlib scikit-learn ipython torch torchaudio xgboost lightgbm scikit-image torchvision

For CUDA support, install PyTorch as per your GPU:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

requirements.txt (recommended content):

pandas
numpy
librosa
seaborn
matplotlib
scikit-learn
ipython
torch
torchaudio
xgboost
lightgbm
scikit-image
torchvision

Usage

Data Preparation

Place all datasets in the dataset/ directory as structured above.
The notebook (Emonity.ipynb) will automatically process and combine the datasets, extract features, and perform augmentation.

Training

Open and run Emonity.ipynb in Jupyter or VSCode.
The notebook will:
- Load and preprocess data
- Extract features and augment data
- Split data into train/val/test
- Train three models: 1D CNN, 2D CNN, CNN-BiLSTM
- Save trained models as .pth files

Example:

# In Emonity.ipynb
model_1d_cnn, history_1d_cnn = train_model_advanced(
    model_1d_cnn, train_loader_1d, val_loader_1d, 
    num_epochs=150, learning_rate=0.0005, weight_decay=0.01
)
torch.save(model_1d_cnn.state_dict(), 'enhanced_1d_cnn_model.pth')

Evaluation

The notebook evaluates each model and the ensemble on the test set, reporting accuracy, precision, recall, F1-score, and confusion matrices.

Best results:
- 1D CNN: Accuracy: 0.705, Precision: 0.712, Recall: 0.705, F1: 0.698
- 2D CNN: Accuracy: 0.901, Precision: 0.903, Recall: 0.901, F1: 0.901
- CNN-BiLSTM: Accuracy: 0.794, Precision: 0.796, Recall: 0.794, F1: 0.793
- Ensemble: Accuracy: 0.885, Precision: 0.888, Recall: 0.885, F1: 0.885

Model	Accuracy	Precision	Recall	F1-Score
1D CNN	0.705	0.712	0.705	0.698
2D CNN	0.901	0.903	0.901	0.901
CNN-BiLSTM	0.794	0.796	0.794	0.793
Ensemble	0.885	0.888	0.885	0.885

See the notebook for per-class metrics and confusion matrices.

Real-Time Inference

The notebook provides a RealTimeEmotionClassifier class for low-latency emotion prediction from audio files.

Example usage:

from SpeechEr import RealTimeEmotionClassifier
classifier = RealTimeEmotionClassifier('speech_emotion_ensemble_model.pth')
emotion, confidence, probabilities, inference_time = classifier.predict_emotion('path/to/audio.wav')
print(f"Predicted: {emotion} (confidence: {confidence:.2f}, time: {inference_time:.3f}s)")

Model Architectures

Enhanced 1D CNN: Self-attention, batch norm, dropout, global pooling
Enhanced 2D CNN: ResNet-like blocks, batch norm, dropout, global pooling
Enhanced CNN-BiLSTM: CNN feature extractor, BiLSTM, attention, dense layers
Ensemble: Weighted average of all models' predictions

Results

Overall Metrics

Model	Accuracy	Precision	Recall	F1-Score
1D CNN	0.705	0.712	0.705	0.698
2D CNN	0.901	0.903	0.901	0.901
CNN-BiLSTM	0.794	0.796	0.794	0.793
Ensemble	0.885	0.888	0.885	0.885

Per-Class Metrics

Enhanced 1D CNN

Class	Precision	Recall	F1-Score
angry	0.7301	0.8567	0.7883
disgust	0.6941	0.6317	0.6614
fear	0.7878	0.4950	0.6080
happy	0.7238	0.5633	0.6336
neutral	0.6381	0.8433	0.7265
sad	0.6294	0.6850	0.6560
surprise	0.8186	0.9439	0.8768

Enhanced 2D CNN

Class	Precision	Recall	F1-Score
angry	0.9401	0.9150	0.9274
disgust	0.8887	0.8783	0.8835
fear	0.9325	0.8517	0.8902
happy	0.9201	0.8833	0.9014
neutral	0.9012	0.9117	0.9064
sad	0.8174	0.9100	0.8612
surprise	0.9303	0.9872	0.9579

Enhanced CNN-BiLSTM

Class	Precision	Recall	F1-Score
angry	0.8495	0.8750	0.8621
disgust	0.8081	0.7300	0.7671
fear	0.7797	0.6783	0.7255
happy	0.8027	0.6917	0.7431
neutral	0.7576	0.8750	0.8121
sad	0.7038	0.8000	0.7488
surprise	0.9115	0.9719	0.9407

Ensemble Model

Metric	Value
Accuracy	0.8850
Precision	0.8875
Recall	0.8850
F1-Score	0.8845

See the notebook for confusion matrices and further analysis.

Troubleshooting

CUDA not available: Ensure you have a compatible GPU and the correct PyTorch version installed.
Dataset not found: Double-check dataset paths and structure.
Out of memory: Reduce batch size or use a machine with more RAM/GPU memory.
Librosa or torchaudio errors: Ensure all dependencies are installed and up to date.

References

License

This project is for academic and research purposes. Please check individual dataset licenses for usage restrictions.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
.gitignore		.gitignore
Emonity.ipynb		Emonity.ipynb
LICENSE		LICENSE
README.md		README.md
data_path.csv		data_path.csv
enhanced_1d_cnn_model.pth		enhanced_1d_cnn_model.pth
enhanced_2d_cnn_model.pth		enhanced_2d_cnn_model.pth
enhanced_cnn_bilstm_model.pth		enhanced_cnn_bilstm_model.pth
setup.py		setup.py
speech_emotion_ensemble_model.pth		speech_emotion_ensemble_model.pth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emonity: Speech Emotion Recognition with MFCC and Deep Learning

Table of Contents

Overview

Features

Datasets

Installation

Usage

Data Preparation

Training

Evaluation

Real-Time Inference

Model Architectures

Results

Overall Metrics

Per-Class Metrics

Enhanced 1D CNN

Enhanced 2D CNN

Enhanced CNN-BiLSTM

Ensemble Model

Troubleshooting

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Emonity: Speech Emotion Recognition with MFCC and Deep Learning

Table of Contents

Overview

Features

Datasets

Installation

Usage

Data Preparation

Training

Evaluation

Real-Time Inference

Model Architectures

Results

Overall Metrics

Per-Class Metrics

Enhanced 1D CNN

Enhanced 2D CNN

Enhanced CNN-BiLSTM

Ensemble Model

Troubleshooting

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages