This project is a robust and modular solution for extracting job data from multiple Avature-based career portals.
The primary focus of this implementation was not just "raw extraction," but data quality, resilience against failures, and politeness (anti-ban measures).
This project was developed and measured against three core pillars of enterprise scraping.
- Total Portals Discovered: 514 unique Avature-based career sites.
- Total Jobs Scraped: 6,684 unique job listings.
- Reliability: By solving the "Lying Pagination" problem (see below), we achieved near 100% coverage on complex portals like Bloomberg and L'Oréal that typically fail with standard scraping techniques.
- API Reverse Engineering: Instead of just scraping HTML, we reverse-engineered Avature's internal
/SearchJobsAPI. We built an automated "Endpoint Fuzzer" that detects the correct API path for any given subdomain. - Smart Offset Mechanism: We solved the "Lying Pagination" problem (where Avature returns duplicate items in HTML) by implementing a dynamic offset. It increments based on unique items found, ensuring no jobs are skipped and no duplicates are processed.
- Modular Pipeline: A 3-stage architecture (Discovery → Extraction → Enrichment) that decouples domain finding from data extraction, allowing for independent scaling and debugging.
- Resilience First: We prioritized stealth and persistence (High Jitter, Low Concurrency, Incremental Saving) over raw speed. This ensured the pipeline could process 500+ domains without a single IP ban or session timeout.
- Data Cleanliness: We implemented strict "Anti-Social" filters to remove LinkedIn, Facebook, and "Share" links that frequently pollute Avature job lists.
- Transparent Iteration: While the current dataset is highly refined, we acknowledge that some "residual noise" (non-job items) might still persist in edge-case portals.
- Known Edge Case: The portal
manpowergroupco.avature.netcurrently requires a custom parser adjustment. It presents non-job informational blocks with the same HTML structure as valid job listings, occasionally leading to false positives in the dataset. This is a identified target for the next optimization cycle.
- Known Edge Case: The portal
- Schema Validation: Used Pydantic models to ensure every job entry adheres to a strict schema before being saved to the final output.
Below are the technical decisions made to resolve specific challenges encountered during the reverse engineering of Avature portals.
Observation: Avature's internal API often returns HTML containing duplicate job listings on the same page (e.g., hidden desktop and mobile versions) or returns fewer items than the requested limit.
Common Failure: Fixed offset increments (e.g., offset += 50) caused the scraper to "skip" real jobs when the API returned only 20 unique items masked within 42 HTML elements.
Our Solution: We implemented a Uniqueness-Based Dynamic Offset. The scraper counts how many unique and valid URLs were extracted from the current page and increments the offset by exactly that number.
Result: Increased job coverage from ~50% to 100% on complex portals like L'Oréal and Bloomberg.
Observation: Many portals insert "Share on LinkedIn" or "Facebook" buttons within the job list using the same HTML structure as a real job listing.
Common Failure: Extracting "Share" as if it were a job title.
Our Solution: Implementation of Strict Rejection Filters in the extraction layer (extractor.py). Any link or title containing keywords like share, linkedinApi, facebook, or whatsapp is discarded before entering the pipeline.
To ensure the highest possible data quality for job details, we used a cascading parsing strategy (enricher.py):
- Gold (JSON-LD): Top priority. Extracts
schema.orgstructured data (invisible to the user, but perfect for machines). - Silver (Semantic HTML): If Gold fails, it uses selectors specific to Avature's structure (BEM CSS).
- Bronze (Fallback): Last resort, searches for generic classes (
job-description,content).
To avoid blocking (429/403 errors) and ensure long-running execution:
- High Jitter: Random pauses of 2 to 5 seconds between requests.
- Low Concurrency: Limited to 2 simultaneous threads (simulating human navigation).
- Incremental Saving: Discovery and Enrichment stages save data line-by-line. If the process is interrupted, progress is preserved.
The project was divided into 3 independent stages to facilitate scalability and debugging:
-
Discovery (
run_discovery.py)- Uses Multi-Source Intelligence (DuckDuckGo + Passive Sources like
crt.shandAlienVault) to find*.avature.netsubdomains. - Maintains an incremental database in
data/outputs/discovered_urls.json.
- Uses Multi-Source Intelligence (DuckDuckGo + Passive Sources like
-
Extraction (
run_extraction.py)- Reads the discovered domains.
- Performs reverse engineering of the
/SearchJobsAPI (automatically detecting endpoint variations). - Generates the raw dataset (
jobs.json) with high pagination reliability.
-
Enrichment (
run_enrichment.py)- Visits each job URL individually.
- Extracts full description, clean HTML, and normalized metadata.
- Saves in
JSONLformat (jobs_enriched.jsonl), ideal for Big Data ingestion.
- Python 3.10+
- Dependencies installed:
pip install -r requirements.txt
To run all stages sequentially:
python main.pyStep A: Portal Discovery
python run_discovery.pyOutput: data/outputs/discovered_urls.json
Step B: List Extraction
python run_extraction.pyOutput: data/outputs/jobs.json
Step C: Data Enrichment
python run_enrichment.pyOutput: data/outputs/jobs_enriched.jsonl
The src/config.py file controls the bot's behavior. The current default is Ultra-Conservative mode:
REQUEST_TIMEOUT = 60 # High tolerance for slow servers
MAX_CONCURRENT_REQUESTS = 2 # Low concurrency (avoids detection)
RATE_LIMIT_DELAY = 3.0 # Base pause
MAX_RETRIES = 7 # Persistence in case of errorsTo increase speed (in controlled environments), increase MAX_CONCURRENT_REQUESTS and decrease RATE_LIMIT_DELAY.
- aiohttp: High-performance asynchronous HTTP requests.
- BeautifulSoup4 (lxml): Robust and fault-tolerant HTML parsing.
- Pydantic: Data validation and strong typing.
- Loguru: Structured and readable logging.
- Ruff: Linter and formatter to ensure code quality (PEP 8).