Unified web scraper supporting Chrome (Playwright-managed Chromium) and Obscura browser engines from a single codebase.
scraper/
├── config.js # Unified configuration
├── main.js # Entry point (CLI)
├── src/
│ ├── cli/
│ │ └── app.js # Commander CLI
│ ├── core/
│ │ ├── engine.js # Browser factory
│ │ ├── generate_report.js # Chrome vs Obscura report HTML generator
│ │ ├── obscura.js # Obscura binary lifecycle
│ │ ├── renderer.js # Unified page renderer
│ │ ├── parser.js # HTML parser (script/style stripping)
│ │ ├── scraper.js # Orchestrator (sequential)
│ │ ├── storage.js # File persistence + report writer
│ │ ├── stats.js # Performance statistics
│ │ └── csv_reader.js # CSV URL extractor
│ └── utils/
│ └── logger.js # Structured logger
└── output/
├── chrome/ # Chrome engine results
│ └── <category>/
│ ├── <urlId>/
│ │ ├── raw.html
│ │ ├── parsed.html
│ │ ├── event_stats.json
│ │ ├── dom_metrics.json
│ │ └── network.har
│ ├── report.json
│ └── resource_samples.json
└── obscura/ # Obscura engine results
└── <category>/
└── ...
npm installOptional: install or point to Chrome and Obscura binaries via environment variables (see below).
# Chrome (default)
ENGINE=chrome node main.js single https://example.com
# Obscura
ENGINE=obscura node main.js single https://example.com# Scrape a single URL
node main.js single <url>
# Scrape all URLs from a CSV file
node main.js csv <path/to/urls.csv>
# Generate a comparison report (HTML)
node main.js report \
--chrome-report <path/to/chrome/report.json> \
--chrome-samples <path/to/chrome/resource_samples.json> \
--obscura-report <path/to/obscura/report.json> \
--obscura-samples <path/to/obscura/resource_samples.json> \
--out <path/to/comparison.html>npm run start
npm run single -- https://example.com
npm run csv -- ./data/Test.csvCSV format: any file with a url column header, or a single column of URLs. Non-URL cells are ignored; duplicates are removed.
- Each run writes to
output/<engine>/<category>/where category issingleor the CSV filename (without extension). - The category directory is cleared before each run.
- URL folders are named by an 8-character MD5 hash of the URL.
report.jsonincludes per-URL results and aggregate stats for the run.resource_samples.jsonstores periodic CPU/RAM samples for Node and (when available) the browser process tree.
The report command builds an HTML comparison using Chrome and Obscura reports plus their resource samples. If any flag is omitted, defaults come from config.js (see REPORT_PATHS and DEFAULT_REPORT_CATEGORY).
| Variable | Default | Description |
|---|---|---|
ENGINE |
chrome |
Browser engine: chrome or obscura |
TIMEOUT_MS |
30000 |
Timeout for browser operations (context, page, content) |
WAIT_UNTIL |
domcontentloaded |
Playwright waitUntil strategy |
WAIT_AFTER_LOAD_MS |
2000 |
Wait after page load |
SETTLE_MS |
500 |
Extra wait before capturing content (fixed in config) |
MAX_RETRIES |
2 |
Retry attempts per URL |
MIN_CONTENT_LENGTH |
100 |
Minimum HTML length for successful capture |
ENABLE_SCROLL |
true |
Scroll page after load |
SCROLL_STEPS |
8 |
PageDown steps during scroll |
SCROLL_STEP_DELAY_MS |
200 |
Delay between scroll steps |
ENABLE_EVENT_SIM |
true |
Simulate user interactions |
EVENT_MOUSE_MOVES |
5 |
Mouse move events to simulate |
EVENT_CLICK_COUNT |
3 |
Click interactions per selector bucket |
EVENT_KEYSTROKES |
8 |
Keystrokes to type in the first input |
EVENT_SCROLL_STEPS |
5 |
Scroll steps during event simulation |
ENABLE_HAR |
true |
Capture network HAR |
HAR_CONTENT |
omit |
HAR content mode for Playwright (omit, embed, attach) |
CHROME_PATH |
/usr/bin/google-chrome |
Chrome/Chromium executable path |
OBSCURA_BINARY |
/home/surya-pt8233/Documents/Tasks/task3/obscura/target/release/obscura |
Path to Obscura binary |
OBSCURA_PORT |
9222 |
Obscura CDP port |
OBSCURA_WORKERS |
1 |
Obscura worker count |
OBSCURA_CDP_URL |
ws://127.0.0.1:9222/devtools/browser |
Obscura CDP endpoint URL |
OBSCURA_STARTUP_TIMEOUT_MS |
15000 |
Obscura startup timeout |
DEFAULT_REPORT_CATEGORY |
Test (1) |
Category used for report defaults in report command |
LOG_LEVEL |
INFO |
DEBUG, INFO, WARN, ERROR |
LOG_FILE |
scraper.log |
Log file path (relative to project root) |
- Sequential execution: one URL at a time for comparable benchmarks between engines.
- Separate output directories:
output/chrome/andoutput/obscura/keep results isolated. - Obscura auto-restart: unhealthy processes are restarted up to 3 times; failed workers trigger a recycle during CSV runs.
- Event stats and DOM metrics are stored alongside raw and parsed HTML; HAR is written when enabled.
- Resource sampling uses
pidusagewhen available; if PID resolution fails, browser OS stats are skipped.