Playwright-based scraping pipeline that finds Rive animation creators across Contra, Behance, and X (Twitter), then deduplicates, validates, and exports them to CSV.
Login (persistent browser profile)
│
┌───────┼───────┐
▼ ▼ ▼
Contra Behance X
│ │ │
└───────┼───────┘
▼
Merge & Deduplicate
▼
Validate Links (HEAD/GET)
▼
Export → data/out/candidates.csv
Sources:
- Contra — scrolls the Rive people listing, visits each profile to extract name, email, location, and portfolio links
- Behance — searches five Rive-related queries, extracts creator info from project pages and ld+json metadata
- X — scrolls the
@rive_apptimeline, identifies tweets mentioning Rive or linking to.rivfiles
Pipeline stages:
- Collect — source-specific scrapers gather candidate profiles with evidence of Rive expertise
- Deduplicate — merges candidates across sources using identity keys (email, social usernames, website domain, name+link)
- Validate — sends HTTP HEAD requests (GET fallback) to all profile links; keeps candidates with at least one 200 response
- Export — writes the final list to CSV, capped at the target count
rive_scout/
├── src/
│ ├── main.py # CLI entry point (--login / --run)
│ ├── config.py # URLs, CSV columns, timeouts
│ ├── browser.py # Playwright context with anti-detection
│ ├── enrich.py # Deduplication and candidate merging
│ ├── validate.py # Link validation and filtering
│ ├── export_csv.py # CSV export
│ ├── utils_http.py # HTTP requests with retries
│ ├── utils_text.py # Email/URL/name parsing, Rive signal detection
│ └── sources/
│ ├── contra.py # Contra scraper
│ ├── behance.py # Behance scraper
│ └── x_rive.py # X/Twitter scraper
├── data/
│ ├── raw/ # Debug snapshots (JSON, HTML, PNG)
│ ├── cache/ # Reserved for future use
│ ├── out/ # Final CSV output
│ └── profile/ # Persistent Chromium session data
├── requirements.txt
└── .env.example
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m playwright install chromiumCopy .env.example to .env and adjust if needed:
HEADLESS=true
TARGET=50
python -m src.main --loginOpens a headed browser with tabs for X, Instagram, Contra, and Behance. Log in manually, then press Enter in the terminal. Sessions persist in data/profile/.
python -m src.main --run --target 50 --headless true| Flag | Description | Default |
|---|---|---|
--run |
Execute the scraping pipeline | — |
--login |
Open browser for manual login | — |
--target N |
Target candidate count | 50 |
--headless true/false |
Run browser headlessly | true |
--sources contra,behance,x |
Select sources to scrape | all three |
Results are written to data/out/candidates.csv with these columns:
| Column | Description |
|---|---|
| Full name | Candidate name |
| Email address | Extracted from profile page |
| Instagram profile | Instagram URL if found |
| Website | Personal/portfolio website |
| Platform portfolio | Contra/Behance profile URL |
| Best work | Notable project URL (usually Behance) |
| Why impressive | Left blank unless source evidence exists |
| Country | Parsed from location string |
| Source | Contra, Behance, or X |
| Notes | Rive signals, availability mentions |
| Primary profile link | Main URL used for validation |
| Primary link status | HTTP status code |
| Evidence links | All discovered URLs |
| Validation notes | HTTP validation details |
- No fabrication — every candidate has at least one real collected profile link
- Individuals only — agencies, studios, and collectives are filtered out
- Rive signal required — candidates must show explicit evidence of Rive knowledge
- Social link tolerance — Instagram/X may return 403/429 due to rate limiting; candidates are kept if another link validates as 200
- Anti-detection — persistent browser profile, randomized pauses, webdriver flag removal, custom User-Agent
- Debug artifacts — each pipeline stage saves JSON snapshots to
data/raw/for troubleshooting