A distributed TTB COLA scraper using a producer-consumer architecture with 4 download VMs and 1 processing VM.
- Producers (VM1-VM4): Download CSV files, detail pages, and certificate pages from TTB
- Consumer (VM5): Processes stored files and populates the database
- Storage: Organized file storage with VM-specific folders
- Coordination: VM status tracking and progress monitoring
✅ No S3 bucket required! Uses local filesystem storage.
-
Quick Start:
# One-command setup and test ./test-local.sh -
Manual Setup:
# Start all services docker compose up -d # Wait for initialization sleep 30 # Run tests python scripts/test_local_docker.py
-
Monitor System:
# Flower dashboard open http://localhost:5555 # View logs docker compose logs -f celery-downloader-1
See DOCKER_TESTING.md for comprehensive Docker testing guide.
-
Setup Environment:
cp .env.example .env # Edit .env with your database URL -
Run Downloader VM:
python scripts/run_downloader.py --vm-id vm1
-
Run Processor VM:
python scripts/run_processor.py --vm-id vm5
-
Prerequisites:
- AWS S3 bucket and credentials
- Render account with GitHub connected
-
Deploy:
./deploy-render.sh
-
Configure S3 Credentials:
- Add
S3_ACCESS_KEYandS3_SECRET_KEYto each service in Render dashboard
- Add
-
Monitor:
python scripts/monitor_render_deployment.py
See RENDER_DEPLOYMENT.md for detailed deployment instructions.
├── src/ # Source code
│ ├── coordination/ # VM coordination and status tracking
│ ├── storage/ # File storage management (local/S3/Render)
│ ├── downloader/ # TTB file downloading
│ ├── processor/ # File processing and database population
│ ├── scraper/ # TTB parsers and data processing
│ ├── db/ # Database models and connection
│ ├── ttb/ # TTB session management
│ └── utils/ # Logging and utilities
├── scripts/ # Execution scripts
├── reference/v1-archive/ # Archived v1 monolithic scraper
├── docker-compose.yml # Local Docker testing setup
├── DOCKER_TESTING.md # Docker testing guide
└── NEW_ARCHITECTURE.md # Detailed architecture documentation
- celery-downloader-1: Downloads years 2024, 2020, 2016, 2012, 2008, 2004, 2000, 1996
- celery-downloader-2: Downloads years 2023, 2019, 2015, 2011, 2007, 2003, 1999, 1995
- celery-downloader-3: Downloads years 2022, 2018, 2014, 2010, 2006, 2002, 1998
- celery-downloader-4: Downloads years 2021, 2017, 2013, 2009, 2005, 2001, 1997
- celery-processor: Processes all downloaded files into database
- PostgreSQL: Shared database for coordination and data storage
- Redis: Message broker for Celery tasks
- Flower: Monitoring dashboard for Celery tasks
# Check all services status
python scripts/monitor_render_deployment.py
# Monitor via Flower dashboard
open https://your-flower-service.onrender.com
# View logs in Render dashboard
open https://dashboard.render.com
# Access database via Render dashboard
# Use Render's built-in database connection tools- Monthly: ~$150-200 for 5 services + database
- One-time data collection: ~$300-500 for complete historical dataset
- Ongoing: Daily incremental updates ~$50/month
- DOCKER_TESTING.md - Complete Docker testing guide
- NEW_ARCHITECTURE.md - Complete architecture overview
- reference/v1-archive/docs/ - Original documentation
- Fault Tolerance: VMs restart independently without losing progress
- Scalability: Downloads and processing are separated
- Storage: All files preserved for reprocessing
- Monitoring: Real-time VM status and progress tracking
- Render Optimized: Built specifically for Render deployment
- Docker Testing: Production-like local testing environment
storage/
├── downloads/vm1/csv/ # VM1 CSV files
├── downloads/vm1/details/ # VM1 detail pages
├── downloads/vm1/certificates/ # VM1 certificate pages
├── downloads/vm2-4/ # Other VMs (similar structure)
├── processing/status/ # Processing status tracking
└── coordination/ # VM coordination data
# Test locally with Docker
docker compose up -d
python scripts/test_local_docker.py# Test individual components
python scripts/run_downloader.py --vm-id vm1 --test-mode
python scripts/run_processor.py --vm-id vm5 --test-mode# Deploy to Render
./deploy-render.sh
# Monitor production
python scripts/monitor_render_deployment.py- Use
docker-compose.yml - Local PostgreSQL in container
- Shared volumes for file storage
- Easy debugging and testing
- Use
STORAGE_BACKEND=local - Single VM testing
- PostgreSQL in Docker container
- Use
STORAGE_BACKEND=render_disk - Full 5-service distributed processing with Celery
- Managed PostgreSQL and Redis databases
- Auto-scaling and fault tolerance
This producer-consumer architecture ensures reliable data collection with the ability to resume from any failure point.