Skip to content

Science-Prof-Robot/autoclick

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Auto Scraping to CSV

ClawHub Skill License: MIT

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. No external LLM required — Claude acts as the host model (brain) that handles complex page nuances, asks clarifying questions, and manages the full scraping workflow.

Philosophy: Agent-Driven Scraping

Unlike traditional scrapers that require you to write CSS selectors or XPath queries, this tool lets the agent figure it out:

  1. You describe what you want — "Get all product names and prices"
  2. The agent explores the page — scrolls, clicks pagination, handles lazy loading
  3. The agent asks clarifying questions — "I see 3 price fields (retail, sale, member). Which one do you want?"
  4. The agent handles edge cases — infinite scroll, popups, login walls, SPA navigation
  5. You get a CSV — clean, structured, ready to use
You: "Scrape Hacker News top stories"
Agent: "I see 30 stories on the page. Do you want:
        A) Just titles and URLs
        B) Titles, URLs, points, and comment counts
        C) All of the above plus author names?"
You: "B"
Agent: [scrapes] → stories.csv (30 rows)

Installation by Platform

Claude Code

# 1. Clone or copy the skill folder to your Claude Code workspace
git clone https://github.com/Science-Prof-Robot/autoclick.git
cp -r autoclick/.claude/skills/auto-scraping-to-csv ~/.claude/skills/

# 2. Copy the bridge script to agents folder
cp ~/.claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs ~/.claude/agents/

# 3. Install Playwright dependency
npm install -D playwright
npx playwright install chromium

# 4. Start the bridge (keep running in a terminal)
node ~/.claude/agents/page-agent-bridge.mjs

Use in Claude Code:

/scrape-to-csv https://news.ycombinator.com "Get top 30 stories with titles, points, and comment counts"

Cursor

# 1. Clone the repo
git clone https://github.com/Science-Prof-Robot/autoclick.git

# 2. Copy skill files to Cursor's skills directory
# (Cursor uses the same .cursor/skills/ convention)
cp -r autoclick/.claude/skills/auto-scraping-to-csv ~/.cursor/skills/
cp autoclick/.claude/agents/page-agent-bridge.mjs ~/.cursor/agents/

# 3. Install Playwright
npm install -D playwright
npx playwright install chromium

# 4. Start the bridge
node ~/.cursor/agents/page-agent-bridge.mjs

Use in Cursor:

/scrape-to-csv https://example.com/products "Extract product catalog with prices"

OpenClaw

# 1. Install via ClawHub CLI
clawhub install auto-scraping-to-csv

# 2. Copy the bridge script from installed skill to agents
cp skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/

# 3. Install Playwright (if not already installed)
npm install -D playwright
npx playwright install chromium

# 4. Start the bridge
node .claude/agents/page-agent-bridge.mjs

Use in OpenClaw:

/scrape-to-csv https://www.anthropic.com/news "Get latest blog posts"

Manual / Any Editor

# 1. Clone
git clone https://github.com/Science-Prof-Robot/autoclick.git
cd autoclick

# 2. Install Playwright
npm install
npx playwright install chromium

# 3. Start bridge
node .claude/agents/page-agent-bridge.mjs

# 4. Use curl or any HTTP client to interact with the bridge
# See API Reference below

How It Works

Claude (Host Model)
    ↕  HTTP
Bridge Server (Node.js + Playwright)
    ↕  page.evaluate()
Browser (Chromium) ← Page-Agent injected
  • Text-based DOM: No screenshots, no vision model needed. The agent reads simplified HTML with indexed elements.
  • Host model: Claude is the reasoning engine. No OpenAI/Qwen API key needed.
  • Agent-driven: The agent handles scrolling, pagination, popups, and asks you what to do when ambiguous.
  • CSV export: Built-in workflow to convert scraped structured data to CSV.

Quick Start

Step 1: Start the Bridge

node .claude/agents/page-agent-bridge.mjs

Default port: 9876.

Step 2: Scrape (Agent Handles the Rest)

# Create a session — the agent will explore the page
curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "headless": true}'

The agent will:

  1. Fetch the DOM state
  2. Ask you what data to extract if ambiguous
  3. Handle scrolling/pagination if needed
  4. Extract structured JSON
  5. Convert to CSV

Step 3: Get Your CSV

The agent saves output.csv in your working directory.


What the Agent Handles For You

Complex Scenario Agent Behavior
Infinite scroll Auto-scrolls, detects "no more content", stops
Pagination Clicks "Next", extracts from all pages, asks how many pages
Popups / modals Detects overlays, dismisses or asks if relevant
Lazy loading Waits for content, retries, times out gracefully
Login walls Detects auth required, asks for credentials or stops
Multiple data formats Asks: "I see prices as '$19.99' and 'USD 19.99'. Which format?"
Missing fields Some items lack prices. Asks: "Skip those rows or fill with 'N/A'?"
Tables vs lists Detects layout, asks: "Table has 5 columns. Which ones do you want?"

Natural Language Commands

When the skill is active, just ask:

/scrape-to-csv https://news.ycombinator.com
  "Get top stories with title, URL, points, and comments"

/scrape-table https://example.com/pricing
  "Extract the pricing table"

/scrape-news https://www.anthropic.com/news
  "Latest blog posts with titles, dates, and URLs"

/scrape-products https://amazon.com/s?k=laptops
  "Laptop listings: name, price, rating, Prime eligibility"

The agent will:

  1. Navigate to the page
  2. Explore the DOM
  3. Ask clarifying questions if needed
  4. Extract data
  5. Save as CSV
  6. Show you a preview

API Reference (For Power Users)

Start Session

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "headless": true}'

Get DOM State

curl http://localhost:9876/sessions/SESSION_ID/state

Execute Action

curl -X POST http://localhost:9876/sessions/SESSION_ID/act \
  -H "Content-Type: application/json" \
  -d '{"action": "executeJavascript", "params": {"script": "return document.title;"}}'

Close Session

curl -X DELETE http://localhost:9876/sessions/SESSION_ID

Files

File Description
.claude/skills/auto-scraping-to-csv/SKILL.md Skill definition with full instructions
.claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs Bridge script (bundled with skill)
.claude/agents/page-agent-bridge.mjs Bridge script (copy here to run)

ClawHub

Published as a ClawHub skill: auto-scraping-to-csv

clawhub install auto-scraping-to-csv

License

MIT

About

Natural language scraping tool for claude code, openclaw and cursor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors