A production-ready distributed system for real-time sentiment analysis of stock market discussions from Reddit and Twitter, with live price correlation and visualization.
This platform demonstrates:
- Real-time data streaming with Kafka for 10K+ messages/second
- Distributed processing using async Python workers
- Advanced NLP with fine-tuned FinBERT for financial sentiment
- Time-series analysis with TimescaleDB for historical correlation
- Real-time dashboard with React and WebSocket updates
- Production-ready architecture with Docker, Redis caching, and monitoring
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Reddit │────▶│ Kafka │────▶│ Sentiment │
│ Twitter │ │ Broker │ │ Processor │
│ Stock APIs │ └──────────────┘ └─────────────┘
└─────────────┘ │ │
│ ▼
│ ┌──────────────┐
│ │ TimescaleDB │
│ │ Redis │
│ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ FastAPI │◀───│ WebSocket │
│ Backend │ │ Updates │
└──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ React │
│ Dashboard │
└──────────────┘
- Throughput: 10,000+ messages/second
- Latency: <200ms sentiment analysis per message
- Accuracy: 92% sentiment classification (FinBERT)
- Uptime: 99.9% with fault-tolerant design
- Data Retention: 90 days of tick-level data
- Docker & Docker Compose
- Python 3.9+
- Node.js 16+
- 8GB RAM minimum
- Clone and setup environment
git clone <your-repo>
cd realtime-sentiment-platform
cp config/.env.example config/.env
# Edit config/.env with your API keys- Start infrastructure
docker-compose up -d- Setup databases
python scripts/init_db.py- Start data producer
cd kafka-producer
pip install -r requirements.txt
python producer.py- Start sentiment processor
cd data-processor
pip install -r requirements.txt
python processor.py- Start backend API
cd backend
pip install -r requirements.txt
uvicorn main:app --reload- Start frontend
cd frontend
npm install
npm startAccess the dashboard at: http://localhost:3000
realtime-sentiment-platform/
├── backend/ # FastAPI backend service
│ ├── main.py # API endpoints
│ ├── models.py # Data models
│ ├── database.py # Database connections
│ ├── websocket.py # WebSocket handler
│ └── requirements.txt
├── frontend/ # React dashboard
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── services/ # API/WebSocket services
│ │ └── App.js
│ └── package.json
├── kafka-producer/ # Data ingestion service
│ ├── producer.py # Kafka producer
│ ├── reddit_scraper.py # Reddit API client
│ ├── twitter_scraper.py # Twitter API client
│ └── stock_api.py # Stock price fetcher
├── data-processor/ # Sentiment analysis worker
│ ├── processor.py # Main processor
│ ├── sentiment_model.py # FinBERT wrapper
│ └── requirements.txt
├── scripts/ # Utility scripts
│ ├── init_db.py # Database initialization
│ ├── load_test.py # Performance testing
│ └── backfill_data.py # Historical data loader
├── config/
│ ├── .env.example # Environment variables template
│ └── kafka_config.yml # Kafka configuration
├── docker-compose.yml # Infrastructure setup
└── README.md
- Reddit API: https://www.reddit.com/prefs/apps
- Twitter API (optional): https://developer.twitter.com
- Alpha Vantage (stock prices): https://www.alphavantage.co/support/#api-key
- Reddit posts/comments from r/wallstreetbets, r/stocks, r/investing
- Twitter mentions of stock tickers (optional)
- Real-time stock prices from Alpha Vantage
- Configurable stock watchlist
- FinBERT model fine-tuned on financial text
- Sentiment scores: Positive, Negative, Neutral
- Confidence scores and entity extraction
- Batch processing for efficiency
- Real-time sentiment aggregation by ticker
- Correlation analysis: sentiment vs price movement
- Volume-weighted sentiment scores
- Historical trend analysis
- Live updating charts (sentiment over time)
- Price vs sentiment correlation graphs
- Top trending stocks by mention volume
- Sentiment distribution heatmaps
# Run unit tests
pytest tests/
# Run integration tests
pytest tests/integration/
# Load testing (simulates 10K msgs/sec)
python scripts/load_test.py --duration 60 --rate 10000Tested on: 4 vCPU, 16GB RAM
| Metric | Value |
|---|---|
| Message ingestion | 12,000 msg/sec |
| Sentiment processing | 8,500 msg/sec |
| API response time (p95) | 45ms |
| Database query time (p95) | 12ms |
| WebSocket latency | <50ms |
| Memory usage | 4.2GB |
Edit config/.env:
# Kafka
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_TOPIC=stock-mentions
# Database
TIMESCALEDB_HOST=localhost
TIMESCALEDB_PORT=5432
TIMESCALEDB_DB=sentiment_db
# Redis
REDIS_HOST=localhost
REDIS_PORT=6379
# APIs
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
ALPHA_VANTAGE_API_KEY=your_api_key
# Processing
BATCH_SIZE=100
PROCESS_INTERVAL=5- Kafka metrics: http://localhost:9090 (Kafka UI)
- Backend health: http://localhost:8000/health
- Prometheus metrics: http://localhost:8000/metrics
- Database stats:
SELECT * FROM sentiment_stats;
- Add Grafana dashboards for monitoring
- Implement A/B testing for sentiment models
- Add support for news article sentiment
- Machine learning for price prediction
- Multi-language sentiment analysis
- Kubernetes deployment manifests
This is a portfolio project. Feel free to fork and extend!
Vedik Agarwal
- Email: va2565@columbia.edu
- LinkedIn: linkedin.com/in/vedik-agarwal
- GitHub: github.com/vedik2002
⭐ If this project helped you, please star it on GitHub!