Skip to content

arslanmit/realworld-nlp-techniques

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

113 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-world NLP: COVID-19 Tweet Emotion Classification

Overview

This project analyzes ~3,500 English tweets from the early COVID-19 pandemic and classifies each into one of 10 emotion categories. It uses the SenWave COVID-19 sentiment dataset and compares two modeling approaches:

  • A baseline TF-IDF + Logistic Regression classifier
  • A fine-tuned BERT transformer model

The pipeline includes data cleaning and exploration, model training, evaluation, and result visualization.

Dataset

  • Dataset: SenWave COVID-19 Sentiment dataset of ~3,500 English tweets (each labeled with one dominant emotion).
  • Emotion Labels (10): Optimistic, Thankful, Empathetic, Pessimistic, Anxious, Sad, Annoyed, Denial, Official report, Joking.

Key Features

Data Processing

  • Text Cleaning: Standard text preprocessing.
  • Class Imbalance Handling: Tomek Links under-sampling.
  • Train/Test Split: 80/20 stratified split.

Models

  1. Baseline (TF-IDF + Logistic Regression) – Simple baseline model with tuned hyperparameters; fast to train and infer.
  2. BERT (Fine-tuned) – Fine-tunes a base uncased BERT model with a custom training loop (early stopping) and class weighting for imbalanced data.

Evaluation Metrics

  • Accuracy
  • Macro/Micro F1
  • Precision & Recall
  • ROC–AUC (ROC curves and AUC)
  • Confusion matrix

Visualization

  • Class distribution plots
  • Training curves
  • Model comparison charts
  • Sentiment analysis visualizations
  • Interactive HTML reports

Setup

Prerequisites: Python 3.8+ and pip

Installation:

  1. Clone the repository and navigate into it:
    git clone <repository-url>
    cd realworld-nlp-techniques
  2. Create and activate a virtual environment:
    python3 -m venv .venv
    source .venv/bin/activate  # Windows: .venv\Scripts\activate
  3. Install required packages:
    pip install -r requirements.txt
  4. Place the dataset CSV file at Dataset/SenWave/SenWave_single_emotion_labeled.csv (ensure it has columns Tweet and Emotion).

Usage

Full Pipeline

Run the entire pipeline (EDA, training both models, evaluation, and visualizations):

python main.py

Quick Test (Skip BERT Fine-tuning)

For a faster run without BERT fine-tuning:

QUICK_TEST=1 python main.py

Output

All results are saved under output/<timestamp>/ in a timestamped folder with subfolders:

  • plots/: generated charts and figures
  • tables/: metrics and predictions
  • models/: saved model checkpoints
  • logs/: execution logs

Results

Model Performance

Model Accuracy Macro F1 Training Time
Logistic Reg. 55.2% 0.44 ~1 min
BERT 68.3% 0.48 ~30 min

Performance Notes

  • BERT performs better overall, especially on rare emotion classes.
  • The baseline offers a good balance between speed and performance.
  • Certain emotion categories remain challenging due to class imbalance.

Customization

Configuration: Adjust the Config class in main.py to modify training parameters, model settings, output paths, or early stopping criteria.

Adding New Models:

  1. Create a new model class (implement training and prediction methods).
  2. Add the new model into the evaluation pipeline.
  3. Add any necessary evaluation metrics.
  4. Update comparison and visualization scripts accordingly.

Troubleshooting

  1. Missing dependencies: Ensure all packages are installed (pip install -r requirements.txt).
  2. CUDA out of memory: Reduce the batch size or run on CPU (set CUDA_VISIBLE_DEVICES="" to disable GPU).
  3. Dataset not found: Verify the CSV is placed at Dataset/SenWave/SenWave_single_emotion_labeled.csv (with correct file name and permissions).

Citation and References

If you use the SenWave dataset, please cite the original paper:

@article{yang2020senwave,
  title={SenWave: Monitoring the Global Sentiments under the COVID-19 Pandemic},
  author={Yang, Kai-Cheng and Ferrara, Emilio and Menczer, Filippo},
  journal={arXiv preprint arXiv:2006.10842},
  year={2020}
}

For more details on the dataset, see the SenWave repository.

License

Licensed under the MIT License. See the LICENSE file for details.

About

Applied NLP techniques on real-world datasets using Python & Hugging Face – MSc AI NLP project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors