Real-world NLP: COVID-19 Tweet Emotion Classification

Overview

This project analyzes ~3,500 English tweets from the early COVID-19 pandemic and classifies each into one of 10 emotion categories. It uses the SenWave COVID-19 sentiment dataset and compares two modeling approaches:

A baseline TF-IDF + Logistic Regression classifier
A fine-tuned BERT transformer model

The pipeline includes data cleaning and exploration, model training, evaluation, and result visualization.

Dataset

Dataset: SenWave COVID-19 Sentiment dataset of ~3,500 English tweets (each labeled with one dominant emotion).
Emotion Labels (10): Optimistic, Thankful, Empathetic, Pessimistic, Anxious, Sad, Annoyed, Denial, Official report, Joking.

Key Features

Data Processing

Text Cleaning: Standard text preprocessing.
Class Imbalance Handling: Tomek Links under-sampling.
Train/Test Split: 80/20 stratified split.

Models

Baseline (TF-IDF + Logistic Regression) – Simple baseline model with tuned hyperparameters; fast to train and infer.
BERT (Fine-tuned) – Fine-tunes a base uncased BERT model with a custom training loop (early stopping) and class weighting for imbalanced data.

Evaluation Metrics

Accuracy
Macro/Micro F1
Precision & Recall
ROC–AUC (ROC curves and AUC)
Confusion matrix

Visualization

Class distribution plots
Training curves
Model comparison charts
Sentiment analysis visualizations
Interactive HTML reports

Setup

Prerequisites: Python 3.8+ and pip

Installation:

Clone the repository and navigate into it:

git clone <repository-url>
cd realworld-nlp-techniques

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

Install required packages:
```
pip install -r requirements.txt
```
Place the dataset CSV file at Dataset/SenWave/SenWave_single_emotion_labeled.csv (ensure it has columns Tweet and Emotion).

Usage

Full Pipeline

Run the entire pipeline (EDA, training both models, evaluation, and visualizations):

python main.py

Quick Test (Skip BERT Fine-tuning)

For a faster run without BERT fine-tuning:

QUICK_TEST=1 python main.py

Output

All results are saved under output/<timestamp>/ in a timestamped folder with subfolders:

plots/: generated charts and figures
tables/: metrics and predictions
models/: saved model checkpoints
logs/: execution logs

Results

Model Performance

Model	Accuracy	Macro F1	Training Time
Logistic Reg.	55.2%	0.44	~1 min
BERT	68.3%	0.48	~30 min

Performance Notes

BERT performs better overall, especially on rare emotion classes.
The baseline offers a good balance between speed and performance.
Certain emotion categories remain challenging due to class imbalance.

Customization

Configuration: Adjust the Config class in main.py to modify training parameters, model settings, output paths, or early stopping criteria.

Adding New Models:

Create a new model class (implement training and prediction methods).
Add the new model into the evaluation pipeline.
Add any necessary evaluation metrics.
Update comparison and visualization scripts accordingly.

Troubleshooting

Missing dependencies: Ensure all packages are installed (pip install -r requirements.txt).
CUDA out of memory: Reduce the batch size or run on CPU (set CUDA_VISIBLE_DEVICES="" to disable GPU).
Dataset not found: Verify the CSV is placed at Dataset/SenWave/SenWave_single_emotion_labeled.csv (with correct file name and permissions).

Citation and References

If you use the SenWave dataset, please cite the original paper:

@article{yang2020senwave,
  title={SenWave: Monitoring the Global Sentiments under the COVID-19 Pandemic},
  author={Yang, Kai-Cheng and Ferrara, Emilio and Menczer, Filippo},
  journal={arXiv preprint arXiv:2006.10842},
  year={2020}
}

For more details on the dataset, see the SenWave repository.

License

Licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
Dataset/SenWave		Dataset/SenWave
essay		essay
output		output
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-world NLP: COVID-19 Tweet Emotion Classification

Overview

Dataset

Key Features

Data Processing

Models

Evaluation Metrics

Visualization

Setup

Usage

Full Pipeline

Quick Test (Skip BERT Fine-tuning)

Output

Results

Model Performance

Performance Notes

Customization

Troubleshooting

Citation and References

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real-world NLP: COVID-19 Tweet Emotion Classification

Overview

Dataset

Key Features

Data Processing

Models

Evaluation Metrics

Visualization

Setup

Usage

Full Pipeline

Quick Test (Skip BERT Fine-tuning)

Output

Results

Model Performance

Performance Notes

Customization

Troubleshooting

Citation and References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages