Skip to content

wesleyolvr/ats-scraper

Repository files navigation

Avature Data Scraper - Enterprise Grade

This project is a robust and modular solution for extracting job data from multiple Avature-based career portals.

The primary focus of this implementation was not just "raw extraction," but data quality, resilience against failures, and politeness (anti-ban measures).


🏁 Project Evaluation & Metrics

This project was developed and measured against three core pillars of enterprise scraping.

1. Coverage (Primary Metric)

  • Total Portals Discovered: 514 unique Avature-based career sites.
  • Total Jobs Scraped: 6,684 unique job listings.
  • Reliability: By solving the "Lying Pagination" problem (see below), we achieved near 100% coverage on complex portals like Bloomberg and L'Oréal that typically fail with standard scraping techniques.

2. Engineering Logic

  • API Reverse Engineering: Instead of just scraping HTML, we reverse-engineered Avature's internal /SearchJobs API. We built an automated "Endpoint Fuzzer" that detects the correct API path for any given subdomain.
  • Smart Offset Mechanism: We solved the "Lying Pagination" problem (where Avature returns duplicate items in HTML) by implementing a dynamic offset. It increments based on unique items found, ensuring no jobs are skipped and no duplicates are processed.
  • Modular Pipeline: A 3-stage architecture (Discovery → Extraction → Enrichment) that decouples domain finding from data extraction, allowing for independent scaling and debugging.
  • Resilience First: We prioritized stealth and persistence (High Jitter, Low Concurrency, Incremental Saving) over raw speed. This ensured the pipeline could process 500+ domains without a single IP ban or session timeout.

3. Attention to Detail & Quality Control

  • Data Cleanliness: We implemented strict "Anti-Social" filters to remove LinkedIn, Facebook, and "Share" links that frequently pollute Avature job lists.
  • Transparent Iteration: While the current dataset is highly refined, we acknowledge that some "residual noise" (non-job items) might still persist in edge-case portals.
    • Known Edge Case: The portal manpowergroupco.avature.net currently requires a custom parser adjustment. It presents non-job informational blocks with the same HTML structure as valid job listings, occasionally leading to false positives in the dataset. This is a identified target for the next optimization cycle.
  • Schema Validation: Used Pydantic models to ensure every job entry adheres to a strict schema before being saved to the final output.

🧠 Engineering & Architectural Decisions

Below are the technical decisions made to resolve specific challenges encountered during the reverse engineering of Avature portals.

1. The "Lying" Pagination Problem (Smart Offset)

Observation: Avature's internal API often returns HTML containing duplicate job listings on the same page (e.g., hidden desktop and mobile versions) or returns fewer items than the requested limit. Common Failure: Fixed offset increments (e.g., offset += 50) caused the scraper to "skip" real jobs when the API returned only 20 unique items masked within 42 HTML elements. Our Solution: We implemented a Uniqueness-Based Dynamic Offset. The scraper counts how many unique and valid URLs were extracted from the current page and increments the offset by exactly that number.

Result: Increased job coverage from ~50% to 100% on complex portals like L'Oréal and Bloomberg.

2. "Anti-Social" Filters (Data Quality)

Observation: Many portals insert "Share on LinkedIn" or "Facebook" buttons within the job list using the same HTML structure as a real job listing. Common Failure: Extracting "Share" as if it were a job title. Our Solution: Implementation of Strict Rejection Filters in the extraction layer (extractor.py). Any link or title containing keywords like share, linkedinApi, facebook, or whatsapp is discarded before entering the pipeline.

3. "Waterfall" Enrichment Strategy

To ensure the highest possible data quality for job details, we used a cascading parsing strategy (enricher.py):

  1. Gold (JSON-LD): Top priority. Extracts schema.org structured data (invisible to the user, but perfect for machines).
  2. Silver (Semantic HTML): If Gold fails, it uses selectors specific to Avature's structure (BEM CSS).
  3. Bronze (Fallback): Last resort, searches for generic classes (job-description, content).

4. Politeness & Resilience (Stealth Mode)

To avoid blocking (429/403 errors) and ensure long-running execution:

  • High Jitter: Random pauses of 2 to 5 seconds between requests.
  • Low Concurrency: Limited to 2 simultaneous threads (simulating human navigation).
  • Incremental Saving: Discovery and Enrichment stages save data line-by-line. If the process is interrupted, progress is preserved.

🛠️ Modular Architecture

The project was divided into 3 independent stages to facilitate scalability and debugging:

  1. Discovery (run_discovery.py)

    • Uses Multi-Source Intelligence (DuckDuckGo + Passive Sources like crt.sh and AlienVault) to find *.avature.net subdomains.
    • Maintains an incremental database in data/outputs/discovered_urls.json.
  2. Extraction (run_extraction.py)

    • Reads the discovered domains.
    • Performs reverse engineering of the /SearchJobs API (automatically detecting endpoint variations).
    • Generates the raw dataset (jobs.json) with high pagination reliability.
  3. Enrichment (run_enrichment.py)

    • Visits each job URL individually.
    • Extracts full description, clean HTML, and normalized metadata.
    • Saves in JSONL format (jobs_enriched.jsonl), ideal for Big Data ingestion.

🚀 How to Run

Prerequisites

  • Python 3.10+
  • Dependencies installed: pip install -r requirements.txt

1. Automated Execution (Full Pipeline)

To run all stages sequentially:

python main.py

2. Manual Execution (Step-by-Step)

Step A: Portal Discovery

python run_discovery.py

Output: data/outputs/discovered_urls.json

Step B: List Extraction

python run_extraction.py

Output: data/outputs/jobs.json

Step C: Data Enrichment

python run_enrichment.py

Output: data/outputs/jobs_enriched.jsonl


⚙️ Configuration

The src/config.py file controls the bot's behavior. The current default is Ultra-Conservative mode:

REQUEST_TIMEOUT = 60        # High tolerance for slow servers
MAX_CONCURRENT_REQUESTS = 2 # Low concurrency (avoids detection)
RATE_LIMIT_DELAY = 3.0      # Base pause
MAX_RETRIES = 7             # Persistence in case of errors

To increase speed (in controlled environments), increase MAX_CONCURRENT_REQUESTS and decrease RATE_LIMIT_DELAY.


📊 Tech Stack

  • aiohttp: High-performance asynchronous HTTP requests.
  • BeautifulSoup4 (lxml): Robust and fault-tolerant HTML parsing.
  • Pydantic: Data validation and strong typing.
  • Loguru: Structured and readable logging.
  • Ruff: Linter and formatter to ensure code quality (PEP 8).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages