Skip to content

J-SURYA/scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper

Unified web scraper supporting Chrome (Playwright-managed Chromium) and Obscura browser engines from a single codebase.

Directory Structure

scraper/
├── config.js              # Unified configuration
├── main.js                # Entry point (CLI)
├── src/
│   ├── cli/
│   │   └── app.js         # Commander CLI
│   ├── core/
│   │   ├── engine.js      # Browser factory
│   │   ├── generate_report.js # Chrome vs Obscura report HTML generator
│   │   ├── obscura.js     # Obscura binary lifecycle
│   │   ├── renderer.js    # Unified page renderer
│   │   ├── parser.js      # HTML parser (script/style stripping)
│   │   ├── scraper.js     # Orchestrator (sequential)
│   │   ├── storage.js     # File persistence + report writer
│   │   ├── stats.js       # Performance statistics
│   │   └── csv_reader.js  # CSV URL extractor
│   └── utils/
│       └── logger.js      # Structured logger
└── output/
    ├── chrome/            # Chrome engine results
    │   └── <category>/
    │       ├── <urlId>/
    │       │   ├── raw.html
    │       │   ├── parsed.html
    │       │   ├── event_stats.json
    │       │   ├── dom_metrics.json
    │       │   └── network.har
    │       ├── report.json
    │       └── resource_samples.json
    └── obscura/           # Obscura engine results
        └── <category>/
            └── ...

Setup

npm install

Optional: install or point to Chrome and Obscura binaries via environment variables (see below).

Usage

Engine selection

# Chrome (default)
ENGINE=chrome node main.js single https://example.com

# Obscura
ENGINE=obscura node main.js single https://example.com

CLI commands

# Scrape a single URL
node main.js single <url>

# Scrape all URLs from a CSV file
node main.js csv <path/to/urls.csv>

# Generate a comparison report (HTML)
node main.js report \
    --chrome-report <path/to/chrome/report.json> \
    --chrome-samples <path/to/chrome/resource_samples.json> \
    --obscura-report <path/to/obscura/report.json> \
    --obscura-samples <path/to/obscura/resource_samples.json> \
    --out <path/to/comparison.html>

npm scripts

npm run start
npm run single -- https://example.com
npm run csv -- ./data/Test.csv

CSV format: any file with a url column header, or a single column of URLs. Non-URL cells are ignored; duplicates are removed.

Output

  • Each run writes to output/<engine>/<category>/ where category is single or the CSV filename (without extension).
  • The category directory is cleared before each run.
  • URL folders are named by an 8-character MD5 hash of the URL.
  • report.json includes per-URL results and aggregate stats for the run.
  • resource_samples.json stores periodic CPU/RAM samples for Node and (when available) the browser process tree.

Report Generation

The report command builds an HTML comparison using Chrome and Obscura reports plus their resource samples. If any flag is omitted, defaults come from config.js (see REPORT_PATHS and DEFAULT_REPORT_CATEGORY).

Environment Variables

Variable Default Description
ENGINE chrome Browser engine: chrome or obscura
TIMEOUT_MS 30000 Timeout for browser operations (context, page, content)
WAIT_UNTIL domcontentloaded Playwright waitUntil strategy
WAIT_AFTER_LOAD_MS 2000 Wait after page load
SETTLE_MS 500 Extra wait before capturing content (fixed in config)
MAX_RETRIES 2 Retry attempts per URL
MIN_CONTENT_LENGTH 100 Minimum HTML length for successful capture
ENABLE_SCROLL true Scroll page after load
SCROLL_STEPS 8 PageDown steps during scroll
SCROLL_STEP_DELAY_MS 200 Delay between scroll steps
ENABLE_EVENT_SIM true Simulate user interactions
EVENT_MOUSE_MOVES 5 Mouse move events to simulate
EVENT_CLICK_COUNT 3 Click interactions per selector bucket
EVENT_KEYSTROKES 8 Keystrokes to type in the first input
EVENT_SCROLL_STEPS 5 Scroll steps during event simulation
ENABLE_HAR true Capture network HAR
HAR_CONTENT omit HAR content mode for Playwright (omit, embed, attach)
CHROME_PATH /usr/bin/google-chrome Chrome/Chromium executable path
OBSCURA_BINARY /home/surya-pt8233/Documents/Tasks/task3/obscura/target/release/obscura Path to Obscura binary
OBSCURA_PORT 9222 Obscura CDP port
OBSCURA_WORKERS 1 Obscura worker count
OBSCURA_CDP_URL ws://127.0.0.1:9222/devtools/browser Obscura CDP endpoint URL
OBSCURA_STARTUP_TIMEOUT_MS 15000 Obscura startup timeout
DEFAULT_REPORT_CATEGORY Test (1) Category used for report defaults in report command
LOG_LEVEL INFO DEBUG, INFO, WARN, ERROR
LOG_FILE scraper.log Log file path (relative to project root)

Notes

  • Sequential execution: one URL at a time for comparable benchmarks between engines.
  • Separate output directories: output/chrome/ and output/obscura/ keep results isolated.
  • Obscura auto-restart: unhealthy processes are restarted up to 3 times; failed workers trigger a recycle during CSV runs.
  • Event stats and DOM metrics are stored alongside raw and parsed HTML; HAR is written when enabled.
  • Resource sampling uses pidusage when available; if PID resolution fails, browser OS stats are skipped.

About

Chrome vs Obscura - A Dual Browser Engine Scraping System

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors