Skip to content

yordsel/ttb-cola-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTB COLA Producer-Consumer Scraper

A distributed TTB COLA scraper using a producer-consumer architecture with 4 download VMs and 1 processing VM.

Architecture

  • Producers (VM1-VM4): Download CSV files, detail pages, and certificate pages from TTB
  • Consumer (VM5): Processes stored files and populates the database
  • Storage: Organized file storage with VM-specific folders
  • Coordination: VM status tracking and progress monitoring

Quick Start

Docker Testing (Recommended for Local Development)

✅ No S3 bucket required! Uses local filesystem storage.

  1. Quick Start:

    # One-command setup and test
    ./test-local.sh
  2. Manual Setup:

    # Start all services
    docker compose up -d
    
    # Wait for initialization
    sleep 30
    
    # Run tests
    python scripts/test_local_docker.py
  3. Monitor System:

    # Flower dashboard
    open http://localhost:5555
    
    # View logs
    docker compose logs -f celery-downloader-1

See DOCKER_TESTING.md for comprehensive Docker testing guide.

Local Development

  1. Setup Environment:

    cp .env.example .env
    # Edit .env with your database URL
  2. Run Downloader VM:

    python scripts/run_downloader.py --vm-id vm1
  3. Run Processor VM:

    python scripts/run_processor.py --vm-id vm5

Render Deployment (Production)

⚠️ IMPORTANT: Render deployment requires S3 storage - see RENDER_DEPLOYMENT.md for complete setup guide.

  1. Prerequisites:

    • AWS S3 bucket and credentials
    • Render account with GitHub connected
  2. Deploy:

    ./deploy-render.sh
  3. Configure S3 Credentials:

    • Add S3_ACCESS_KEY and S3_SECRET_KEY to each service in Render dashboard
  4. Monitor:

    python scripts/monitor_render_deployment.py

See RENDER_DEPLOYMENT.md for detailed deployment instructions.

Project Structure

├── src/                    # Source code
│   ├── coordination/       # VM coordination and status tracking
│   ├── storage/           # File storage management (local/S3/Render)
│   ├── downloader/        # TTB file downloading
│   ├── processor/         # File processing and database population
│   ├── scraper/           # TTB parsers and data processing
│   ├── db/                # Database models and connection
│   ├── ttb/               # TTB session management
│   └── utils/             # Logging and utilities
├── scripts/               # Execution scripts
├── reference/v1-archive/  # Archived v1 monolithic scraper
├── docker-compose.yml     # Local Docker testing setup
├── DOCKER_TESTING.md      # Docker testing guide
└── NEW_ARCHITECTURE.md    # Detailed architecture documentation

Render Deployment Details

Services Deployed

  • celery-downloader-1: Downloads years 2024, 2020, 2016, 2012, 2008, 2004, 2000, 1996
  • celery-downloader-2: Downloads years 2023, 2019, 2015, 2011, 2007, 2003, 1999, 1995
  • celery-downloader-3: Downloads years 2022, 2018, 2014, 2010, 2006, 2002, 1998
  • celery-downloader-4: Downloads years 2021, 2017, 2013, 2009, 2005, 2001, 1997
  • celery-processor: Processes all downloaded files into database
  • PostgreSQL: Shared database for coordination and data storage
  • Redis: Message broker for Celery tasks
  • Flower: Monitoring dashboard for Celery tasks

Monitoring Commands

# Check all services status
python scripts/monitor_render_deployment.py

# Monitor via Flower dashboard
open https://your-flower-service.onrender.com

# View logs in Render dashboard
open https://dashboard.render.com

# Access database via Render dashboard
# Use Render's built-in database connection tools

Cost Estimation

  • Monthly: ~$150-200 for 5 services + database
  • One-time data collection: ~$300-500 for complete historical dataset
  • Ongoing: Daily incremental updates ~$50/month

Documentation

Key Benefits

  • Fault Tolerance: VMs restart independently without losing progress
  • Scalability: Downloads and processing are separated
  • Storage: All files preserved for reprocessing
  • Monitoring: Real-time VM status and progress tracking
  • Render Optimized: Built specifically for Render deployment
  • Docker Testing: Production-like local testing environment

Storage Structure

storage/
├── downloads/vm1/csv/          # VM1 CSV files
├── downloads/vm1/details/      # VM1 detail pages
├── downloads/vm1/certificates/ # VM1 certificate pages
├── downloads/vm2-4/            # Other VMs (similar structure)
├── processing/status/          # Processing status tracking
└── coordination/               # VM coordination data

Development Workflow

1. Local Testing (Docker)

# Test locally with Docker
docker compose up -d
python scripts/test_local_docker.py

2. Development Testing

# Test individual components
python scripts/run_downloader.py --vm-id vm1 --test-mode
python scripts/run_processor.py --vm-id vm5 --test-mode

3. Production Deployment

# Deploy to Render
./deploy-render.sh

# Monitor production
python scripts/monitor_render_deployment.py

Development vs Production

Development (Docker)

  • Use docker-compose.yml
  • Local PostgreSQL in container
  • Shared volumes for file storage
  • Easy debugging and testing

Development (Local)

  • Use STORAGE_BACKEND=local
  • Single VM testing
  • PostgreSQL in Docker container

Production (Render)

  • Use STORAGE_BACKEND=render_disk
  • Full 5-service distributed processing with Celery
  • Managed PostgreSQL and Redis databases
  • Auto-scaling and fault tolerance

This producer-consumer architecture ensures reliable data collection with the ability to resume from any failure point.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors