Skip to content

cspshadx/rive-scout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rive Scout

Playwright-based scraping pipeline that finds Rive animation creators across Contra, Behance, and X (Twitter), then deduplicates, validates, and exports them to CSV.

How It Works

Login (persistent browser profile)
            │
    ┌───────┼───────┐
    ▼       ▼       ▼
  Contra  Behance    X
    │       │       │
    └───────┼───────┘
            ▼
   Merge & Deduplicate
            ▼
   Validate Links (HEAD/GET)
            ▼
   Export → data/out/candidates.csv

Sources:

  • Contra — scrolls the Rive people listing, visits each profile to extract name, email, location, and portfolio links
  • Behance — searches five Rive-related queries, extracts creator info from project pages and ld+json metadata
  • X — scrolls the @rive_app timeline, identifies tweets mentioning Rive or linking to .riv files

Pipeline stages:

  1. Collect — source-specific scrapers gather candidate profiles with evidence of Rive expertise
  2. Deduplicate — merges candidates across sources using identity keys (email, social usernames, website domain, name+link)
  3. Validate — sends HTTP HEAD requests (GET fallback) to all profile links; keeps candidates with at least one 200 response
  4. Export — writes the final list to CSV, capped at the target count

Project Layout

rive_scout/
├── src/
│   ├── main.py           # CLI entry point (--login / --run)
│   ├── config.py         # URLs, CSV columns, timeouts
│   ├── browser.py        # Playwright context with anti-detection
│   ├── enrich.py         # Deduplication and candidate merging
│   ├── validate.py       # Link validation and filtering
│   ├── export_csv.py     # CSV export
│   ├── utils_http.py     # HTTP requests with retries
│   ├── utils_text.py     # Email/URL/name parsing, Rive signal detection
│   └── sources/
│       ├── contra.py     # Contra scraper
│       ├── behance.py    # Behance scraper
│       └── x_rive.py     # X/Twitter scraper
├── data/
│   ├── raw/              # Debug snapshots (JSON, HTML, PNG)
│   ├── cache/            # Reserved for future use
│   ├── out/              # Final CSV output
│   └── profile/          # Persistent Chromium session data
├── requirements.txt
└── .env.example

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m playwright install chromium

Copy .env.example to .env and adjust if needed:

HEADLESS=true
TARGET=50

Usage

Login (one-time)

python -m src.main --login

Opens a headed browser with tabs for X, Instagram, Contra, and Behance. Log in manually, then press Enter in the terminal. Sessions persist in data/profile/.

Run the pipeline

python -m src.main --run --target 50 --headless true
Flag Description Default
--run Execute the scraping pipeline
--login Open browser for manual login
--target N Target candidate count 50
--headless true/false Run browser headlessly true
--sources contra,behance,x Select sources to scrape all three

Output

Results are written to data/out/candidates.csv with these columns:

Column Description
Full name Candidate name
Email address Extracted from profile page
Instagram profile Instagram URL if found
Website Personal/portfolio website
Platform portfolio Contra/Behance profile URL
Best work Notable project URL (usually Behance)
Why impressive Left blank unless source evidence exists
Country Parsed from location string
Source Contra, Behance, or X
Notes Rive signals, availability mentions
Primary profile link Main URL used for validation
Primary link status HTTP status code
Evidence links All discovered URLs
Validation notes HTTP validation details

Design Notes

  • No fabrication — every candidate has at least one real collected profile link
  • Individuals only — agencies, studios, and collectives are filtered out
  • Rive signal required — candidates must show explicit evidence of Rive knowledge
  • Social link tolerance — Instagram/X may return 403/429 due to rate limiting; candidates are kept if another link validates as 200
  • Anti-detection — persistent browser profile, randomized pauses, webdriver flag removal, custom User-Agent
  • Debug artifacts — each pipeline stage saves JSON snapshots to data/raw/ for troubleshooting

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages