🍁 IRCC Automated Data Pipeline & Scraper

An automated Python data pipeline that fetches, filters, and formats real-time policy updates from the Immigration, Refugees and Citizenship Canada (IRCC) official API.

📖 Overview

Keeping track of rapidly changing Canadian immigration policies (like Express Entry category-based draws) can be tedious. This project automates the extraction of IRCC news releases and official notices, filters them for high-priority keywords, and exports the data into structured formats (CSV and JSON) for easy analysis.

✨ Key Features

Targeted Extraction: Filters news feeds for specific 2026 priority categories (e.g., "Doctor", "Express Entry", "Transport").
WAF Evasion: Bypasses aggressive Web Application Firewalls (WAFs) and timeouts by querying the official IRCC Atom/XML feed instead of scraping brittle front-end HTML.
Data Persistence: Automatically sanitizes and exports data into structured ircc_updates.csv and ircc_updates.json files.
Fully Automated: Configured to run silently every morning via Windows Task Scheduler to ensure data is always up to date.

🛠️ Tech Stack

Language: Python 3.x
Libraries: requests, beautifulsoup4 (for XML parsing), csv, json
Automation: Windows Task Scheduler

🚧 Technical Challenges Overcome

The Problem: Initial attempts to scrape the canada.ca frontend resulted in frequent ReadTimeout errors. The government servers actively block automated Python-Requests headers. The Solution: Instead of trying to spoof headers or use slow Selenium instances, I pivoted to reverse-engineering the data source. I located the official backend Atom API feed. By passing the API response through BeautifulSoup's XML parser, the script now runs 10x faster, uses fewer resources, and boasts a 100% success rate without triggering anti-bot protections.

🚀 How to Run Locally

Clone the repository:

git clone https://github.com/yourusername/ircc-scraper.git

Install the required dependencies:
```
pip install -r requirements.txt
```
Run the script:
```
python Web_scraper_IRCC.py
```
Check your project folder for the newly generated .csv and .json files.

⚙️ Customization: Look for other streams

The scraper is designed to be easily configurable. You can change which "streams" or immigration categories the script tracks by modifying the target_keywords list in the code.

To track different news, simply update this line in Web_scraper_IRCC.py:

# Change these to any keywords you want to track (e.g., "Visa", "Francophone", "Citizenship")
target_keywords = ["Doctor", "Physician", "Express Entry", "Transport", "Research"]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🍁 IRCC Automated Data Pipeline & Scraper

📖 Overview

✨ Key Features

🛠️ Tech Stack

🚧 Technical Challenges Overcome

🚀 How to Run Locally

⚙️ Customization: Look for other streams

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

🍁 IRCC Automated Data Pipeline & Scraper

📖 Overview

✨ Key Features

🛠️ Tech Stack

🚧 Technical Challenges Overcome

🚀 How to Run Locally

⚙️ Customization: Look for other streams