Web Scraper

Unified web scraper supporting Chrome (Playwright-managed Chromium) and Obscura browser engines from a single codebase.

Directory Structure

scraper/
├── config.js              # Unified configuration
├── main.js                # Entry point (CLI)
├── src/
│   ├── cli/
│   │   └── app.js         # Commander CLI
│   ├── core/
│   │   ├── engine.js      # Browser factory
│   │   ├── generate_report.js # Chrome vs Obscura report HTML generator
│   │   ├── obscura.js     # Obscura binary lifecycle
│   │   ├── renderer.js    # Unified page renderer
│   │   ├── parser.js      # HTML parser (script/style stripping)
│   │   ├── scraper.js     # Orchestrator (sequential)
│   │   ├── storage.js     # File persistence + report writer
│   │   ├── stats.js       # Performance statistics
│   │   └── csv_reader.js  # CSV URL extractor
│   └── utils/
│       └── logger.js      # Structured logger
└── output/
    ├── chrome/            # Chrome engine results
    │   └── <category>/
    │       ├── <urlId>/
    │       │   ├── raw.html
    │       │   ├── parsed.html
    │       │   ├── event_stats.json
    │       │   ├── dom_metrics.json
    │       │   └── network.har
    │       ├── report.json
    │       └── resource_samples.json
    └── obscura/           # Obscura engine results
        └── <category>/
            └── ...

Setup

npm install

Optional: install or point to Chrome and Obscura binaries via environment variables (see below).

Usage

Engine selection

# Chrome (default)
ENGINE=chrome node main.js single https://example.com

# Obscura
ENGINE=obscura node main.js single https://example.com

CLI commands

# Scrape a single URL
node main.js single <url>

# Scrape all URLs from a CSV file
node main.js csv <path/to/urls.csv>

# Generate a comparison report (HTML)
node main.js report \
    --chrome-report <path/to/chrome/report.json> \
    --chrome-samples <path/to/chrome/resource_samples.json> \
    --obscura-report <path/to/obscura/report.json> \
    --obscura-samples <path/to/obscura/resource_samples.json> \
    --out <path/to/comparison.html>

npm scripts

npm run start
npm run single -- https://example.com
npm run csv -- ./data/Test.csv

CSV format: any file with a url column header, or a single column of URLs. Non-URL cells are ignored; duplicates are removed.

Output

Each run writes to output/<engine>/<category>/ where category is single or the CSV filename (without extension).
The category directory is cleared before each run.
URL folders are named by an 8-character MD5 hash of the URL.
report.json includes per-URL results and aggregate stats for the run.
resource_samples.json stores periodic CPU/RAM samples for Node and (when available) the browser process tree.

Report Generation

The report command builds an HTML comparison using Chrome and Obscura reports plus their resource samples. If any flag is omitted, defaults come from config.js (see REPORT_PATHS and DEFAULT_REPORT_CATEGORY).

Environment Variables

Variable	Default	Description
`ENGINE`	`chrome`	Browser engine: `chrome` or `obscura`
`TIMEOUT_MS`	`30000`	Timeout for browser operations (context, page, content)
`WAIT_UNTIL`	`domcontentloaded`	Playwright `waitUntil` strategy
`WAIT_AFTER_LOAD_MS`	`2000`	Wait after page load
`SETTLE_MS`	`500`	Extra wait before capturing content (fixed in config)
`MAX_RETRIES`	`2`	Retry attempts per URL
`MIN_CONTENT_LENGTH`	`100`	Minimum HTML length for successful capture
`ENABLE_SCROLL`	`true`	Scroll page after load
`SCROLL_STEPS`	`8`	PageDown steps during scroll
`SCROLL_STEP_DELAY_MS`	`200`	Delay between scroll steps
`ENABLE_EVENT_SIM`	`true`	Simulate user interactions
`EVENT_MOUSE_MOVES`	`5`	Mouse move events to simulate
`EVENT_CLICK_COUNT`	`3`	Click interactions per selector bucket
`EVENT_KEYSTROKES`	`8`	Keystrokes to type in the first input
`EVENT_SCROLL_STEPS`	`5`	Scroll steps during event simulation
`ENABLE_HAR`	`true`	Capture network HAR
`HAR_CONTENT`	`omit`	HAR content mode for Playwright (`omit`, `embed`, `attach`)
`CHROME_PATH`	`/usr/bin/google-chrome`	Chrome/Chromium executable path
`OBSCURA_BINARY`	`/home/surya-pt8233/Documents/Tasks/task3/obscura/target/release/obscura`	Path to Obscura binary
`OBSCURA_PORT`	`9222`	Obscura CDP port
`OBSCURA_WORKERS`	`1`	Obscura worker count
`OBSCURA_CDP_URL`	`ws://127.0.0.1:9222/devtools/browser`	Obscura CDP endpoint URL
`OBSCURA_STARTUP_TIMEOUT_MS`	`15000`	Obscura startup timeout
`DEFAULT_REPORT_CATEGORY`	`Test (1)`	Category used for report defaults in `report` command
`LOG_LEVEL`	`INFO`	`DEBUG`, `INFO`, `WARN`, `ERROR`
`LOG_FILE`	`scraper.log`	Log file path (relative to project root)

Notes

Sequential execution: one URL at a time for comparable benchmarks between engines.
Separate output directories: output/chrome/ and output/obscura/ keep results isolated.
Obscura auto-restart: unhealthy processes are restarted up to 3 times; failed workers trigger a recycle during CSV runs.
Event stats and DOM metrics are stored alongside raw and parsed HTML; HAR is written when enabled.
Resource sampling uses pidusage when available; if PID resolution fails, browser OS stats are skipped.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
.gitignore		.gitignore
README.md		README.md
config.js		config.js
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper

Directory Structure

Setup

Usage

Engine selection

CLI commands

npm scripts

Output

Report Generation

Environment Variables

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraper

Directory Structure

Setup

Usage

Engine selection

CLI commands

npm scripts

Output

Report Generation

Environment Variables

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages