Project Description
Author: @aashishvinod
Title: DATA605_Spring2026_Real_Time_Tweet_Sentiment_Analysis_Kafka
Course: DATA605 Spring 2026
Link: https://github.com/gpsaggese/gpsaggese.github.io/blob/master/class_project/data605/Spring2026/projects_descriptions/Apache_Kafka_Project_Description.md
Summary: This project builds a real-time tweet sentiment analysis pipeline using Apache Kafka for stream ingestion and a pre-trained HuggingFace RoBERTa model (cardiffnlp/twitter-roberta-base-sentiment) for sentiment classification. Tweets from the Sentiment140 dataset (1.6M labeled tweets) are streamed through Kafka, classified in real time, and visualized using a live Streamlit dashboard with sentiment trends, comparative model analysis, and anomaly detection.
Planned Workflow:
- Load and preprocess Sentiment140 dataset (1.6M tweets)
- Kafka producer setup (publish tweet events to Kafka topic)
- HuggingFace RoBERTa model loading and classification
- Kafka consumer setup (consume and classify tweets in real time)
- Sentiment aggregation (positive, negative, neutral counts and accuracy)
- Spark SQL analysis (sentiment distribution, high confidence predictions)
- Comparative model analysis (RoBERTa vs DistilBERT accuracy comparison)
- Anomaly detection (detect sudden spikes in sentiment using rolling statistics)
- Live Streamlit dashboard (real-time pie chart, trend chart, metrics)
- Performance analysis (Kafka producer throughput vs batch size)
Tools: Apache Kafka, HuggingFace Transformers (RoBERTa), Apache Spark (PySpark), Streamlit, Docker, Python, Sentiment140 Dataset
Assigned to: @aashishvinod @gpsaggese @protocorn
Project Description
Author: @aashishvinod
Title: DATA605_Spring2026_Real_Time_Tweet_Sentiment_Analysis_Kafka
Course: DATA605 Spring 2026
Link: https://github.com/gpsaggese/gpsaggese.github.io/blob/master/class_project/data605/Spring2026/projects_descriptions/Apache_Kafka_Project_Description.md
Summary: This project builds a real-time tweet sentiment analysis pipeline using Apache Kafka for stream ingestion and a pre-trained HuggingFace RoBERTa model (cardiffnlp/twitter-roberta-base-sentiment) for sentiment classification. Tweets from the Sentiment140 dataset (1.6M labeled tweets) are streamed through Kafka, classified in real time, and visualized using a live Streamlit dashboard with sentiment trends, comparative model analysis, and anomaly detection.
Planned Workflow:
Tools: Apache Kafka, HuggingFace Transformers (RoBERTa), Apache Spark (PySpark), Streamlit, Docker, Python, Sentiment140 Dataset
Assigned to: @aashishvinod @gpsaggese @protocorn