Skip to content

Therm1te/CrisisView-AI-Powered-Emergency-Signal-Filtering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrisisView: AI-Powered Emergency Signal Filtering

Project: Natural Language Processing (NLP) Pipeline for Disaster Tweet Classification


📌 Project Overview

CrisisView is an end-to-end NLP pipeline designed for Emergency Response stakeholders (e.g., Red Cross, FEMA). In the wake of a disaster, social media is often flooded with noise—metaphors ("This party is on fire"), movie reviews, and spam. This tool filters out that noise to identify real-time, actionable disaster alerts.

Objective: Classify tweets as Real Disaster (1) or Not a Real Disaster (0) with high precision.


📂 Dataset


⚙️ Technical Approach

1. Advanced Preprocessing

We implemented a robust cleaning pipeline based on Exploratory Data Analysis (EDA):

  • Noise Removal: Targeted removal of HTML artifacts (&), news sharing terms (via), and platform-specific noise.
  • Normalization: Lowercasing and Lemmatization (WordNet) to reduce sparsity.
  • Privacy: Automated stripping of URLs and User Mentions (@user) to prevent overfitting to specific handles.

2. Feature Engineering (Sparse vs. Dense)

We compared three distinct vectorization strategies:

  1. TF-IDF (Sparse): Captures explicit keyword signals (e.g., "Hiroshima", "flood").
  2. Word2Vec (Dense - Custom): Trained from scratch on the dataset (demonstrates limitations of small-data embeddings).
  3. GloVe (Dense - Pre-trained): Utilized Twitter-27B GloVe embeddings (100d) to leverage Transfer Learning from billions of tweets.

3. Modelling & Optimization

We moved beyond baseline defaults by implementing rigorous experimental controls:

  • Baseline: Multinomial Naive Bayes.
  • Classical ML (Tuned): Logistic Regression optimized via GridSearchCV (5-Fold Cross-Validation) to tune Regularization (C) and Solvers.
  • Deep Learning: A custom Neural Network (Keras/TensorFlow) architecture featuring:
    • Frozen GloVe Embedding Layer
    • GlobalAveragePooling1D
    • Dense Layers with ReLU activation & Dropout for regularization.

📊 Results Summary

The project compared Generative vs. Discriminative models and Sparse vs. Dense features.

Model Feature Set Optimization Performance Notes
Logistic Regression TF-IDF GridSearchCV Top Performer. Excellent balance of Precision/Recall.
Deep Learning GloVe (Transfer Learning) Adam Optimizer Competitive. Captures semantic meaning but computationally heavier.
Logistic Regression Word2Vec (Custom) Default Underperformed due to small training corpus size.
Naive Bayes TF-IDF Default Strong baseline but struggles with context/sarcasm.

Key Insight: While Deep Learning is powerful, TF-IDF with Tuned Logistic Regression proved highly effective for this specific dataset size, highlighting that complex models are not always better for short-text classification.


🚀 Key Features Implemented

  • Text Cleaning: Regex + HTML decoding + Custom Stopwords.
  • Visualization: WordClouds, N-Gram Bar Charts, and Embedding PCA Clusters.
  • Transfer Learning: Integration of pre-trained GloVe vectors.
  • Hyperparameter Tuning: GridSearchCV for optimal model configuration.
  • Generative AI: Simple Markov Chain text generator (Bonus Task).

🛠️ Setup & Usage

Prerequisites

Ensure you have the following installed:

pip install -r ./requirements.txt

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors