Project: Natural Language Processing (NLP) Pipeline for Disaster Tweet Classification
CrisisView is an end-to-end NLP pipeline designed for Emergency Response stakeholders (e.g., Red Cross, FEMA). In the wake of a disaster, social media is often flooded with noise—metaphors ("This party is on fire"), movie reviews, and spam. This tool filters out that noise to identify real-time, actionable disaster alerts.
Objective: Classify tweets as Real Disaster (1) or Not a Real Disaster (0) with high precision.
- Source: Kaggle - Natural Language Processing with Disaster Tweets
- Size: ~7,600 Tweets
- Classes: Binary Classification (Real vs. Fake/Metaphorical)
We implemented a robust cleaning pipeline based on Exploratory Data Analysis (EDA):
- Noise Removal: Targeted removal of HTML artifacts (
&), news sharing terms (via), and platform-specific noise. - Normalization: Lowercasing and Lemmatization (WordNet) to reduce sparsity.
- Privacy: Automated stripping of URLs and User Mentions (
@user) to prevent overfitting to specific handles.
We compared three distinct vectorization strategies:
- TF-IDF (Sparse): Captures explicit keyword signals (e.g., "Hiroshima", "flood").
- Word2Vec (Dense - Custom): Trained from scratch on the dataset (demonstrates limitations of small-data embeddings).
- GloVe (Dense - Pre-trained): Utilized Twitter-27B GloVe embeddings (100d) to leverage Transfer Learning from billions of tweets.
We moved beyond baseline defaults by implementing rigorous experimental controls:
- Baseline: Multinomial Naive Bayes.
- Classical ML (Tuned): Logistic Regression optimized via GridSearchCV (5-Fold Cross-Validation) to tune Regularization (
C) and Solvers. - Deep Learning: A custom Neural Network (Keras/TensorFlow) architecture featuring:
- Frozen GloVe Embedding Layer
- GlobalAveragePooling1D
- Dense Layers with ReLU activation & Dropout for regularization.
The project compared Generative vs. Discriminative models and Sparse vs. Dense features.
| Model | Feature Set | Optimization | Performance Notes |
|---|---|---|---|
| Logistic Regression | TF-IDF | GridSearchCV | Top Performer. Excellent balance of Precision/Recall. |
| Deep Learning | GloVe (Transfer Learning) | Adam Optimizer | Competitive. Captures semantic meaning but computationally heavier. |
| Logistic Regression | Word2Vec (Custom) | Default | Underperformed due to small training corpus size. |
| Naive Bayes | TF-IDF | Default | Strong baseline but struggles with context/sarcasm. |
Key Insight: While Deep Learning is powerful, TF-IDF with Tuned Logistic Regression proved highly effective for this specific dataset size, highlighting that complex models are not always better for short-text classification.
- ✅ Text Cleaning: Regex + HTML decoding + Custom Stopwords.
- ✅ Visualization: WordClouds, N-Gram Bar Charts, and Embedding PCA Clusters.
- ✅ Transfer Learning: Integration of pre-trained GloVe vectors.
- ✅ Hyperparameter Tuning:
GridSearchCVfor optimal model configuration. - ✅ Generative AI: Simple Markov Chain text generator (Bonus Task).
Ensure you have the following installed:
pip install -r ./requirements.txt