This project analyzes ~3,500 English tweets from the early COVID-19 pandemic and classifies each into one of 10 emotion categories. It uses the SenWave COVID-19 sentiment dataset and compares two modeling approaches:
- A baseline TF-IDF + Logistic Regression classifier
- A fine-tuned BERT transformer model
The pipeline includes data cleaning and exploration, model training, evaluation, and result visualization.
- Dataset: SenWave COVID-19 Sentiment dataset of ~3,500 English tweets (each labeled with one dominant emotion).
- Emotion Labels (10): Optimistic, Thankful, Empathetic, Pessimistic, Anxious, Sad, Annoyed, Denial, Official report, Joking.
- Text Cleaning: Standard text preprocessing.
- Class Imbalance Handling: Tomek Links under-sampling.
- Train/Test Split: 80/20 stratified split.
- Baseline (TF-IDF + Logistic Regression) – Simple baseline model with tuned hyperparameters; fast to train and infer.
- BERT (Fine-tuned) – Fine-tunes a base uncased BERT model with a custom training loop (early stopping) and class weighting for imbalanced data.
- Accuracy
- Macro/Micro F1
- Precision & Recall
- ROC–AUC (ROC curves and AUC)
- Confusion matrix
- Class distribution plots
- Training curves
- Model comparison charts
- Sentiment analysis visualizations
- Interactive HTML reports
Prerequisites: Python 3.8+ and pip
Installation:
- Clone the repository and navigate into it:
git clone <repository-url> cd realworld-nlp-techniques
- Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate
- Install required packages:
pip install -r requirements.txt
- Place the dataset CSV file at
Dataset/SenWave/SenWave_single_emotion_labeled.csv(ensure it has columns Tweet and Emotion).
Run the entire pipeline (EDA, training both models, evaluation, and visualizations):
python main.pyFor a faster run without BERT fine-tuning:
QUICK_TEST=1 python main.pyAll results are saved under output/<timestamp>/ in a timestamped folder with subfolders:
plots/: generated charts and figurestables/: metrics and predictionsmodels/: saved model checkpointslogs/: execution logs
| Model | Accuracy | Macro F1 | Training Time |
|---|---|---|---|
| Logistic Reg. | 55.2% | 0.44 | ~1 min |
| BERT | 68.3% | 0.48 | ~30 min |
- BERT performs better overall, especially on rare emotion classes.
- The baseline offers a good balance between speed and performance.
- Certain emotion categories remain challenging due to class imbalance.
Configuration: Adjust the Config class in main.py to modify training parameters, model settings, output paths, or early stopping criteria.
Adding New Models:
- Create a new model class (implement training and prediction methods).
- Add the new model into the evaluation pipeline.
- Add any necessary evaluation metrics.
- Update comparison and visualization scripts accordingly.
- Missing dependencies: Ensure all packages are installed (
pip install -r requirements.txt). - CUDA out of memory: Reduce the batch size or run on CPU (set
CUDA_VISIBLE_DEVICES=""to disable GPU). - Dataset not found: Verify the CSV is placed at
Dataset/SenWave/SenWave_single_emotion_labeled.csv(with correct file name and permissions).
If you use the SenWave dataset, please cite the original paper:
@article{yang2020senwave,
title={SenWave: Monitoring the Global Sentiments under the COVID-19 Pandemic},
author={Yang, Kai-Cheng and Ferrara, Emilio and Menczer, Filippo},
journal={arXiv preprint arXiv:2006.10842},
year={2020}
}For more details on the dataset, see the SenWave repository.
Licensed under the MIT License. See the LICENSE file for details.