TTB COLA Producer-Consumer Scraper

A distributed TTB COLA scraper using a producer-consumer architecture with 4 download VMs and 1 processing VM.

Architecture

Producers (VM1-VM4): Download CSV files, detail pages, and certificate pages from TTB
Consumer (VM5): Processes stored files and populates the database
Storage: Organized file storage with VM-specific folders
Coordination: VM status tracking and progress monitoring

Quick Start

Docker Testing (Recommended for Local Development)

✅ No S3 bucket required! Uses local filesystem storage.

Quick Start:

# One-command setup and test
./test-local.sh

Manual Setup:

# Start all services
docker compose up -d

# Wait for initialization
sleep 30

# Run tests
python scripts/test_local_docker.py

Monitor System:

# Flower dashboard
open http://localhost:5555

# View logs
docker compose logs -f celery-downloader-1

See DOCKER_TESTING.md for comprehensive Docker testing guide.

Local Development

Setup Environment:

cp .env.example .env
# Edit .env with your database URL

Run Downloader VM:

python scripts/run_downloader.py --vm-id vm1

Run Processor VM:

python scripts/run_processor.py --vm-id vm5

Render Deployment (Production)

⚠️ IMPORTANT: Render deployment requires S3 storage - see RENDER_DEPLOYMENT.md for complete setup guide.

Prerequisites:
- AWS S3 bucket and credentials
- Render account with GitHub connected
Deploy:
```
./deploy-render.sh
```
Configure S3 Credentials:
- Add S3_ACCESS_KEY and S3_SECRET_KEY to each service in Render dashboard

Monitor:

python scripts/monitor_render_deployment.py

See RENDER_DEPLOYMENT.md for detailed deployment instructions.

Project Structure

├── src/                    # Source code
│   ├── coordination/       # VM coordination and status tracking
│   ├── storage/           # File storage management (local/S3/Render)
│   ├── downloader/        # TTB file downloading
│   ├── processor/         # File processing and database population
│   ├── scraper/           # TTB parsers and data processing
│   ├── db/                # Database models and connection
│   ├── ttb/               # TTB session management
│   └── utils/             # Logging and utilities
├── scripts/               # Execution scripts
├── reference/v1-archive/  # Archived v1 monolithic scraper
├── docker-compose.yml     # Local Docker testing setup
├── DOCKER_TESTING.md      # Docker testing guide
└── NEW_ARCHITECTURE.md    # Detailed architecture documentation

Render Deployment Details

Services Deployed

celery-downloader-1: Downloads years 2024, 2020, 2016, 2012, 2008, 2004, 2000, 1996
celery-downloader-2: Downloads years 2023, 2019, 2015, 2011, 2007, 2003, 1999, 1995
celery-downloader-3: Downloads years 2022, 2018, 2014, 2010, 2006, 2002, 1998
celery-downloader-4: Downloads years 2021, 2017, 2013, 2009, 2005, 2001, 1997
celery-processor: Processes all downloaded files into database
PostgreSQL: Shared database for coordination and data storage
Redis: Message broker for Celery tasks
Flower: Monitoring dashboard for Celery tasks

Monitoring Commands

# Check all services status
python scripts/monitor_render_deployment.py

# Monitor via Flower dashboard
open https://your-flower-service.onrender.com

# View logs in Render dashboard
open https://dashboard.render.com

# Access database via Render dashboard
# Use Render's built-in database connection tools

Cost Estimation

Monthly: ~$150-200 for 5 services + database
One-time data collection: ~$300-500 for complete historical dataset
Ongoing: Daily incremental updates ~$50/month

Documentation

DOCKER_TESTING.md - Complete Docker testing guide
NEW_ARCHITECTURE.md - Complete architecture overview
reference/v1-archive/docs/ - Original documentation

Key Benefits

Fault Tolerance: VMs restart independently without losing progress
Scalability: Downloads and processing are separated
Storage: All files preserved for reprocessing
Monitoring: Real-time VM status and progress tracking
Render Optimized: Built specifically for Render deployment
Docker Testing: Production-like local testing environment

Storage Structure

storage/
├── downloads/vm1/csv/          # VM1 CSV files
├── downloads/vm1/details/      # VM1 detail pages
├── downloads/vm1/certificates/ # VM1 certificate pages
├── downloads/vm2-4/            # Other VMs (similar structure)
├── processing/status/          # Processing status tracking
└── coordination/               # VM coordination data

Development Workflow

1. Local Testing (Docker)

# Test locally with Docker
docker compose up -d
python scripts/test_local_docker.py

2. Development Testing

# Test individual components
python scripts/run_downloader.py --vm-id vm1 --test-mode
python scripts/run_processor.py --vm-id vm5 --test-mode

3. Production Deployment

# Deploy to Render
./deploy-render.sh

# Monitor production
python scripts/monitor_render_deployment.py

Development vs Production

Development (Docker)

Use docker-compose.yml
Local PostgreSQL in container
Shared volumes for file storage
Easy debugging and testing

Development (Local)

Use STORAGE_BACKEND=local
Single VM testing
PostgreSQL in Docker container

Production (Render)

Use STORAGE_BACKEND=render_disk
Full 5-service distributed processing with Celery
Managed PostgreSQL and Redis databases
Auto-scaling and fault tolerance

This producer-consumer architecture ensures reliable data collection with the ability to resume from any failure point.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src		src
.cursorignore		.cursorignore
.gitignore		.gitignore
CELERY_MIGRATION_RECAP.md		CELERY_MIGRATION_RECAP.md
CELERY_QUICKSTART.md		CELERY_QUICKSTART.md
DOCKER_TESTING.md		DOCKER_TESTING.md
Dockerfile		Dockerfile
LOCAL_DOCKER_TESTING.md		LOCAL_DOCKER_TESTING.md
MIGRATION_SUMMARY.md		MIGRATION_SUMMARY.md
NEW_ARCHITECTURE.md		NEW_ARCHITECTURE.md
README.md		README.md
RENDER_DEPLOYMENT.md		RENDER_DEPLOYMENT.md
deploy-render.sh		deploy-render.sh
docker-compose.yml		docker-compose.yml
railway-startup.sh		railway-startup.sh
railway.json		railway.json
render-startup.sh		render-startup.sh
render.yaml		render.yaml
requirements.txt		requirements.txt
test-local.sh		test-local.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TTB COLA Producer-Consumer Scraper

Architecture

Quick Start

Docker Testing (Recommended for Local Development)

Local Development

Render Deployment (Production)

Project Structure

Render Deployment Details

Services Deployed

Monitoring Commands

Cost Estimation

Documentation

Key Benefits

Storage Structure

Development Workflow

1. Local Testing (Docker)

2. Development Testing

3. Production Deployment

Development vs Production

Development (Docker)

Development (Local)

Production (Render)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TTB COLA Producer-Consumer Scraper

Architecture

Quick Start

Docker Testing (Recommended for Local Development)

Local Development

Render Deployment (Production)

Project Structure

Render Deployment Details

Services Deployed

Monitoring Commands

Cost Estimation

Documentation

Key Benefits

Storage Structure

Development Workflow

1. Local Testing (Docker)

2. Development Testing

3. Production Deployment

Development vs Production

Development (Docker)

Development (Local)

Production (Render)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages