This document describes all Python scripts in the WebDataScraper project.
Primary script for populating the database with curated Canadian credit card data.
- Uploads 34 cards from
data/curated_cards.json - Includes category rewards and signup bonuses
- Usage:
python seed_cards.py
Orchestrates the complete scraping workflow with duplicate prevention.
- Scrapes from multiple sources
- Validates and deduplicates data
- Uploads to Supabase
- Usage:
python scrape_workflow.py
Core scraper module with parsing utilities for credit card data.
Handles all Supabase database operations for cards, rewards, and bonuses.
check_duplicates.py- Quickly check for duplicate cards in databasereview_duplicates.py- Generate detailed duplicate analysis reportsmonitor_duplicates.py- Continuous monitoring of duplicate entriesdeduplicate_cards.py- Remove duplicate cards (basic)advanced_deduplicate.py- Advanced deduplication with ML-based matchingbulk_deduplicate.py- Batch deduplication operationscleanup_duplicates.py- Remove cards without category rewardsshow_duplicates_report.py- Display duplicate statisticssimple_duplicate_prevention.py- Basic duplicate prevention utilities
data_merger.py- Merge card data from multiple sourcescard_matcher.py- Match and compare card records using fuzzy matchingprogram_matcher.py- Match and normalize reward program names
card_identity_manager.py- Manage unique card identities and deduplicationcard_master_manager.py- Maintain master card records
Base scraper class with HTML parsing utilities for extracting card data.
Focused scraper for specific credit cards or issuers.
Combined scraping and uploading in a single workflow.
Standalone uploader for manually prepared card data.
config.py- Centralized configuration management (loads from.env)logger_config.py- Logging setup with file and console handlersrate_limiter.py- Adaptive rate limiting with domain trackingretry_util.py- Retry decorators with exponential backofferror_handler.py- Custom exceptions and error trackingsupabase_client.py- Generic Supabase client wrapper
reward_programs.py- Manage reward program taxonomyseed_reward_programs.py- Populate reward programs tablebackfill_card_programs.py- Backfill missing reward program associations
apply_migration.py- Show SQL migration instructionsapply_migrations_simple.py- Simple migration application guiderun_migration.py- Migration runner utilityschema.sql- Database schema definitioncombined_migrations.sql- Combined migration scripts
reset_to_seed_data.py- Reset database to seed data stateclear_all_supabase_data.py-⚠️ DANGER: Clear all data from Supabase tablescheck_tables.py- Verify database table structure and contents
README.md- Main project documentationSCRIPTS.md- This file (script reference)PROJECT_ANALYSIS.md- Detailed project analysis and architectureClaude.md- Claude AI assistant context and promptsdocs/- Additional documentationTASK_REWARD_TAXONOMY.md- Reward taxonomy implementationIMPLEMENTATION_SUMMARY.md- Feature implementation details
tests/- Unit tests with pytesttest_scraper.py- Scraper functionality teststest_data_merger.py- Data merging teststest_card_matcher.py- Card matching teststest_duplicate_prevention.py- Duplicate detection teststest_reward_taxonomy.py- Reward taxonomy tests
Run tests: pytest tests/ -v
# 1. Install dependencies
pip install -r requirements.txt
# 2. Configure environment
cp .env.example .env
# Edit .env with your Supabase credentials
# 3. Seed database
python seed_cards.py# Full workflow (recommended)
python scrape_workflow.py
# Or manually
python credit_card_scraper.py
python upload_cards.py# Check for duplicates
python check_duplicates.py
# Review and clean
python review_duplicates.py
python cleanup_duplicates.py
# Reset if needed
python reset_to_seed_data.pyThese scripts can delete data. Use with caution:
clear_all_supabase_data.py- Deletes ALL data from Supabasecleanup_duplicates.py- Removes cards without category rewardsdeduplicate_cards.py- Removes duplicate cards
Always backup your database before running destructive operations.
| Category | Count | Key Scripts |
|---|---|---|
| Main | 4 | seed_cards, scrape_workflow, scraper, uploader |
| Duplicate Management | 9 | check, review, monitor, deduplicate variants |
| Data Operations | 3 | data_merger, card_matcher, program_matcher |
| Scraping | 3 | credit_card_scraper, targeted_scraper, scrape_and_upload |
| Configuration | 6 | config, logger, rate_limiter, retry_util, error_handler |
| Database | 8 | migrations, schema, seed programs, backfill |
| Utilities | 4 | check_tables, reset, clear data |
Total: 37 Python scripts
For more details on any script, run python <script_name>.py --help or read the docstrings in the source code.