Reliable TypeScript crawler for fresher and early-career jobs using API + RSS + HTTP + Playwright with deduplication, PostgreSQL persistence, health checks, and alerting.
Most scrapers depend too heavily on one source. This project uses a tiered strategy so collection continues even when one source underperforms.
- π§ Starts with low-cost, stable sources first
- π‘οΈ Escalates to headless crawling only when needed
- π Deduplicates continuously across runs
- π¦ Persists clean records into PostgreSQL
- π Monitors itself with metrics + health reporting
Tier 0 (pre-source)
Himalayas RSS (runs before orchestrator)
|
v
Orchestrator tiers
- Serper API
- Jobicy RSS
- Indeed RSS
- Internshala HTTP
- Naukri API/HTTP
|
v
Headless tier (Playwright, conditional unless paid proxy)
- Cutshort
- Foundit
- Shine
- TimesJobs
- Wellfound
- Optional: Indeed, LinkedIn
- β Always run if paid/residential proxy is detected
- β‘ Otherwise run only if jobs are below
MIN_JOBS_BEFORE_HEADLESS(default15)
| Source | Mode | Default | Notes |
|---|---|---|---|
| Himalayas | RSS | Enabled | Pre-source run before orchestrator |
| Serper | API | Enabled | Requires SERPER_API_KEY |
| Jobicy | RSS | Enabled | Low-cost supplement |
| Indeed | RSS | Enabled | Lightweight feed ingestion |
| Internshala | HTTP + Cheerio | Enabled | Internship-oriented source |
| Naukri | JSON API + HTML fallback | Enabled | API first, HTML fallback |
| Cutshort | Playwright | Enabled in headless tier | Runs when headless is active |
| Foundit | Playwright | Enabled in headless tier | Runs when headless is active |
| Shine | Playwright | Enabled in headless tier | Runs when headless is active |
| TimesJobs | Playwright | Enabled in headless tier | Runs when headless is active |
| Wellfound | Playwright | Enabled in headless tier | Runs when headless is active |
| Indeed (headless) | Playwright | Disabled | Enable with ENABLE_INDEED=true |
| LinkedIn (headless) | Playwright | Disabled | Enable with ENABLE_LINKEDIN=true + LINKEDIN_COOKIE |
- Node.js 20+
- TypeScript
- Crawlee 3 + Playwright
- PostgreSQL (
pg) - Zod for env/schema validation
npm install
npx playwright install --with-deps chromiumcp .env.example .envMinimum recommended values:
# Database (either DATABASE_URL or PG* variables)
DATABASE_URL=postgresql://postgres:password@localhost:5432/attack
# Required API key
SERPER_API_KEY=your_real_key
# Optional proxies (comma-separated)
PROXY_URLS=http://user:pass@host:portsudo -u postgres psql <<'SQL'
CREATE USER crawler WITH PASSWORD 'change_me';
CREATE DATABASE attack OWNER crawler;
GRANT ALL PRIVILEGES ON DATABASE attack TO crawler;
SQLThen apply migrations:
npm run db:migratenpm startVerbose mode:
npm run start:verboseCrawl-Job uses AI to extract structured job data from scraped HTML pages.
It supports any AI provider via the bundled model-select tool.
npm run setupThis launches an interactive CLI where you:
- Select your AI provider (Anthropic, OpenAI, Gemini, Ollama, Groq, and 15+ more)
- Enter your API key (validated live before saving)
- Select a model
- Config is saved to
.env.modelselectautomatically
On next npm start, Crawl-Job reads .env.modelselect and uses your chosen provider.
Create .env.modelselect in the project root:
MODEL_ID=anthropic/claude-sonnet-4-5
ANTHROPIC_API_KEY=sk-ant-your-key-here| Provider | Example Model ID | Needs API Key |
|---|---|---|
| Ollama (local) | ollama/qwen2.5 |
No |
| LM Studio (local) | lmstudio/llama3.3 |
No |
| Anthropic | anthropic/claude-sonnet-4-5 |
Yes |
| OpenAI | openai/gpt-5.1-codex |
Yes |
| Google Gemini | google/gemini-3-pro-preview |
Yes |
| Groq | groq/llama-3.3-70b-versatile |
Yes |
| OpenRouter | openrouter/anthropic/claude-sonnet-4-5 |
Yes |
| Mistral | mistral/mistral-large-latest |
Yes |
| xAI / Grok | xai/grok-3 |
Yes |
| Z.AI / GLM | zai/glm-5 |
Yes |
| + 9 more | Run npm run setup:list |
β |
If no .env.modelselect exists, Crawl-Job falls back to legacy OLLAMA_* env vars.
Existing .env configurations continue to work with zero changes.
npm run dev- watch mode forsrc/main.tsnpm start- run crawler withtsxnpm run start:verbose- run crawler with verbose logsnpm run build- compile TypeScript todist/npm run start:prod- run compileddist/main.jsnpm run typecheck- strict type check without emitting files
npm run db:migrate- create/alterjobstable and indexesnpm run db:migrate:check- test DB connectivity only
npm run export- export Crawlee dataset tojobs_export.csvnpm run archive- archive old dataset shardsnpm run archive:preview- dry-run archivenpm run archive:status- storage usage + archive inventorynpm run upload- upload pending archives to S3-compatible storagenpm run cleanup- delete old archives based on retentionnpm run cleanup:preview- dry-run cleanupnpm run maintenance- archive -> upload -> cleanupnpm run maintenance:dry- dry-run full maintenance cycle
npm test- build + run all specsnpm run test:verbose- include stack tracesnpm run test:fast- skip build, run specs directly fromdist/npm run test:watch- rerun tests on file changes
| Path | Purpose |
|---|---|
log.txt |
Run log (truncated at startup, rotates when large) |
storage/dedup-store.json |
Persistent dedup fingerprints |
storage/metrics-snapshot.json |
Periodic metrics snapshot |
storage/health-report.json |
Health status report |
storage/datasets/ |
Crawlee dataset storage |
jobs_export.csv |
CSV export from dataset |
The migration creates a jobs table with ingestion + dedup fields:
- Identity:
url,title,company,platform,platform_job_id,apply_url - Content:
description,location,salary,job_type,experience,seniority,posted_date - Crawl metadata:
source,source_tier,scraped_at - Dedup key:
fingerprint(UNIQUE)
Indexes include source/platform/date lookups and a GIN FTS index over title + company.
See .env.example for the full list. Start with:
DATABASE_URLorPGHOST/PGPORT/PGUSER/PGPASSWORD/PGDATABASESERPER_API_KEYPROXY_URLS,PROXY_MIN_COUNT,PROXY_REFRESH_INTERVAL_MINUTESMIN_JOBS_BEFORE_HEADLESS,HEADLESS_MAX_CONCURRENCYENABLE_INDEED,ENABLE_LINKEDIN,LINKEDIN_COOKIEENABLE_DOMAIN_RATE_LIMITINGDEDUP_RETENTION_DAYS,ARCHIVE_AFTER_DAYS,DELETE_AFTER_DAYSENABLE_ALERTS,ALERT_SLACK_WEBHOOK,ALERT_WEBHOOK_URL,ALERT_EMAIL
The crawler tracks:
- request success/failure
- requests per minute
- response-time average
- rate-limit hits and proxy failures
- memory usage and no-progress windows
Health levels:
healthyβdegradedβ οΈ criticalπ¨
When alerts are enabled, notifications are sent using cooldown control via ALERT_COOLDOWN_MIN.
- Add
src/sources/<name>.tsreturningSourceResult - Wire it into
src/orchestrator.ts - Ensure required fields exist:
url,title,company,description - Run
npm test+ local crawl
- Add selectors in
src/config/<site>.ts - Add extractor in
src/extractors/<site>.ts - Register handlers in
src/routes.ts - Seed URLs in
src/main.ts
scripts/setup-server.sh- full-machine bootstrap helperdeploy/setup.sh- one-time service setup helperdeploy/deploy.sh- update/rebuild/restart flowdeploy/job-crawler.service- systemd unit templatedeploy/env.production- environment template
Before running deployment scripts, review usernames, paths, service names, and environment values.
- DB check fails -> verify
DATABASE_URL/ PG vars, then runnpm run db:migrate:check - Very low job count -> verify
SERPER_API_KEY, inspectstorage/health-report.json - Frequent 403/429 -> reduce concurrency, increase delays, use better proxies
- LinkedIn returns 0 -> set
ENABLE_LINKEDIN=trueand validLINKEDIN_COOKIE - Playwright fails -> reinstall with
npx playwright install --with-deps chromium
- Respect target site terms and robots policies where applicable
- Keep request rates conservative
- Avoid aggressive parallelism on protected domains
- Never use personal accounts for automated scraping on risky platforms
Package metadata declares ISC license.