Skip to content

Latest commit

 

History

History
48 lines (33 loc) · 2.7 KB

File metadata and controls

48 lines (33 loc) · 2.7 KB

🍁 IRCC Automated Data Pipeline & Scraper

An automated Python data pipeline that fetches, filters, and formats real-time policy updates from the Immigration, Refugees and Citizenship Canada (IRCC) official API.

📖 Overview

Keeping track of rapidly changing Canadian immigration policies (like Express Entry category-based draws) can be tedious. This project automates the extraction of IRCC news releases and official notices, filters them for high-priority keywords, and exports the data into structured formats (CSV and JSON) for easy analysis.

✨ Key Features

  • Targeted Extraction: Filters news feeds for specific 2026 priority categories (e.g., "Doctor", "Express Entry", "Transport").
  • WAF Evasion: Bypasses aggressive Web Application Firewalls (WAFs) and timeouts by querying the official IRCC Atom/XML feed instead of scraping brittle front-end HTML.
  • Data Persistence: Automatically sanitizes and exports data into structured ircc_updates.csv and ircc_updates.json files.
  • Fully Automated: Configured to run silently every morning via Windows Task Scheduler to ensure data is always up to date.

🛠️ Tech Stack

  • Language: Python 3.x
  • Libraries: requests, beautifulsoup4 (for XML parsing), csv, json
  • Automation: Windows Task Scheduler

🚧 Technical Challenges Overcome

The Problem: Initial attempts to scrape the canada.ca frontend resulted in frequent ReadTimeout errors. The government servers actively block automated Python-Requests headers. The Solution: Instead of trying to spoof headers or use slow Selenium instances, I pivoted to reverse-engineering the data source. I located the official backend Atom API feed. By passing the API response through BeautifulSoup's XML parser, the script now runs 10x faster, uses fewer resources, and boasts a 100% success rate without triggering anti-bot protections.

🚀 How to Run Locally

  1. Clone the repository:

    git clone https://github.com/yourusername/ircc-scraper.git
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    
  3. Run the script:

    python Web_scraper_IRCC.py
    
  4. Check your project folder for the newly generated .csv and .json files.

⚙️ Customization: Look for other streams

The scraper is designed to be easily configurable. You can change which "streams" or immigration categories the script tracks by modifying the target_keywords list in the code.

To track different news, simply update this line in Web_scraper_IRCC.py:

# Change these to any keywords you want to track (e.g., "Visa", "Francophone", "Citizenship")
target_keywords = ["Doctor", "Physician", "Express Entry", "Transport", "Research"]