From ce1a4962d5b201dba2c20a32d39cfd4b4796bd0e Mon Sep 17 00:00:00 2001 From: meirk-brd Date: Thu, 30 Apr 2026 19:08:46 +0300 Subject: [PATCH] feat(brightdata-scrape): add Bright Data web scraping power MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a Kiro power that detects a project's stack and adds production-ready web scraping in the right shape — a reusable module, an API route, or an agent tool — backed by Bright Data's Web Unlocker, SERP API, Web Scraper API, and Browser API. Also wires the Bright Data MCP server into the project so any AI agent that runs against the project (Claude Code, Cursor, Cline, Kiro itself) gains live web tools. The power runs a four-phase orchestrated workflow with confirmation gates between phases: 1. Detect & plan — inspect manifest (package.json, pyproject.toml, requirements.txt, go.mod, Cargo.toml, etc.), classify the stack, pick a single integration pattern (module / route / tool), and propose a plan. 2. Scraping playbook — pick the right Bright Data API and selectors: pre-built scraper if one exists for the domain, Web Unlocker for static HTML, Browser API for JS-rendered or interactive content, SERP API for search results. 3. Integrate — fill the right code template and write generated files into the user's project (with a confirmation gate before any file is written). 4. MCP & verify — wire the Bright Data MCP server into .kiro/settings/mcp.json, run a one-page smoke test, and write a README wrap-up section. First-class language support: Python and TypeScript/JavaScript. Other languages get a generic Web Unlocker curl/HTTP template that the user adapts. Frameworks covered by canonical templates: - Web (route pattern): Next.js (App Router + Pages Router), Express, Fastify, Hono, Koa, FastAPI, Flask, Django. - Agent (tool pattern): LangChain (TS + Python), Anthropic SDK (TS + Python), OpenAI SDK (TS + Python), Mastra, Vercel AI SDK. - Module: bs4 + stdlib (Python), cheerio + fetch-only (TypeScript). Files added: - brightdata-scrape/POWER.md — frontmatter + onboarding + orchestrator pointer. - brightdata-scrape/mcp.json — wires the remote Bright Data MCP server with the API token interpolated from BRIGHTDATA_API_KEY. - brightdata-scrape/steering/scrape-workflow.md — orchestrator. - brightdata-scrape/steering/phase{1..4}-*.md — the four phase steering files, loaded on demand via readSteering. - brightdata-scrape/templates/{module,route,tool,fallback}/* — 22 canonical code templates with {{TARGET_NAME}} / {{TARGET_URL}} / {{SELECTORS}} / etc. placeholders the orchestrator fills at runtime. - README.md — power entry inserted alphabetically. --- README.md | 7 + brightdata-scrape/POWER.md | 85 ++++++++++ brightdata-scrape/mcp.json | 8 + .../steering/phase1-detect-and-plan.md | 160 ++++++++++++++++++ .../steering/phase2-scraping-playbook.md | 85 ++++++++++ .../steering/phase3-integrate.md | 134 +++++++++++++++ .../steering/phase4-mcp-and-verify.md | 113 +++++++++++++ brightdata-scrape/steering/scrape-workflow.md | 65 +++++++ brightdata-scrape/templates/fallback/curl.sh | 27 +++ brightdata-scrape/templates/module/py-bs4.py | 100 +++++++++++ .../templates/module/py-stdlib.py | 140 +++++++++++++++ .../templates/module/ts-cheerio.ts | 80 +++++++++ .../templates/module/ts-fetch.ts | 91 ++++++++++ brightdata-scrape/templates/route/django.py | 32 ++++ brightdata-scrape/templates/route/express.ts | 27 +++ brightdata-scrape/templates/route/fastapi.py | 26 +++ brightdata-scrape/templates/route/fastify.ts | 35 ++++ brightdata-scrape/templates/route/flask.py | 27 +++ brightdata-scrape/templates/route/hono.ts | 29 ++++ brightdata-scrape/templates/route/koa.ts | 33 ++++ .../templates/route/next-app-router.ts | 23 +++ .../templates/route/next-pages-router.ts | 36 ++++ .../templates/tool/anthropic-sdk-py.py | 49 ++++++ .../templates/tool/anthropic-sdk-ts.ts | 43 +++++ .../templates/tool/langchain-py.py | 44 +++++ .../templates/tool/langchain-ts.ts | 32 ++++ brightdata-scrape/templates/tool/mastra.ts | 37 ++++ brightdata-scrape/templates/tool/openai-py.py | 53 ++++++ brightdata-scrape/templates/tool/openai-ts.ts | 46 +++++ .../templates/tool/vercel-ai-sdk.ts | 36 ++++ 30 files changed, 1703 insertions(+) create mode 100644 brightdata-scrape/POWER.md create mode 100644 brightdata-scrape/mcp.json create mode 100644 brightdata-scrape/steering/phase1-detect-and-plan.md create mode 100644 brightdata-scrape/steering/phase2-scraping-playbook.md create mode 100644 brightdata-scrape/steering/phase3-integrate.md create mode 100644 brightdata-scrape/steering/phase4-mcp-and-verify.md create mode 100644 brightdata-scrape/steering/scrape-workflow.md create mode 100644 brightdata-scrape/templates/fallback/curl.sh create mode 100644 brightdata-scrape/templates/module/py-bs4.py create mode 100644 brightdata-scrape/templates/module/py-stdlib.py create mode 100644 brightdata-scrape/templates/module/ts-cheerio.ts create mode 100644 brightdata-scrape/templates/module/ts-fetch.ts create mode 100644 brightdata-scrape/templates/route/django.py create mode 100644 brightdata-scrape/templates/route/express.ts create mode 100644 brightdata-scrape/templates/route/fastapi.py create mode 100644 brightdata-scrape/templates/route/fastify.ts create mode 100644 brightdata-scrape/templates/route/flask.py create mode 100644 brightdata-scrape/templates/route/hono.ts create mode 100644 brightdata-scrape/templates/route/koa.ts create mode 100644 brightdata-scrape/templates/route/next-app-router.ts create mode 100644 brightdata-scrape/templates/route/next-pages-router.ts create mode 100644 brightdata-scrape/templates/tool/anthropic-sdk-py.py create mode 100644 brightdata-scrape/templates/tool/anthropic-sdk-ts.ts create mode 100644 brightdata-scrape/templates/tool/langchain-py.py create mode 100644 brightdata-scrape/templates/tool/langchain-ts.ts create mode 100644 brightdata-scrape/templates/tool/mastra.ts create mode 100644 brightdata-scrape/templates/tool/openai-py.py create mode 100644 brightdata-scrape/templates/tool/openai-ts.ts create mode 100644 brightdata-scrape/templates/tool/vercel-ai-sdk.ts diff --git a/README.md b/README.md index 3ebb744..199aa91 100644 --- a/README.md +++ b/README.md @@ -56,6 +56,13 @@ Documentation is available at https://kiro.dev/docs/powers/ --- +### brightdata-scrape +**Add web scraping to any app with Bright Data** - Detects your project's stack (Python, TypeScript, 9 web frameworks, 8 agent frameworks) and adds production-ready Bright Data web scraping in the right shape — a reusable module, an API route, or an agent tool — backed by Web Unlocker, SERP API, Web Scraper API, and Browser API. Wires the Bright Data MCP server into the project so any AI agent that runs against the project (Claude Code, Cursor, Cline, Kiro itself) gains live web tools. + +**MCP Servers:** brightdata + +--- + ### cloud-architect **Build infrastructure on AWS** - Build AWS infrastructure with CDK in Python following AWS Well-Architected framework best practices. diff --git a/brightdata-scrape/POWER.md b/brightdata-scrape/POWER.md new file mode 100644 index 0000000..a13d202 --- /dev/null +++ b/brightdata-scrape/POWER.md @@ -0,0 +1,85 @@ +--- +name: "brightdata-scrape" +displayName: "Add web scraping to any app with Bright Data" +description: "Detect your project's stack and add production-ready web scraping — generates the right integration pattern (module, API route, or agent tool), wires Bright Data MCP into your project, handles pagination and bot detection." +keywords: ["scrape", "scraping", "scraper", "crawl", "crawler", "web-data", "extract", "extract-data", "competitor", "pricing-monitor", "lead-generation", "amazon", "linkedin", "instagram", "tiktok", "youtube", "serp", "google-search", "search-engine", "brightdata", "bright-data", "web-unlocker", "browser-api", "captcha", "bot-detection", "pagination", "agent-tools", "mcp"] +author: "Bright Data" +--- + +# Add web scraping to any app with Bright Data + +This power detects what kind of project you're working on and adds production-ready scraping in the right shape — a reusable module, an API route, or an agent tool — backed by Bright Data's Web Unlocker, Web Scraper API, Browser API, and SERP API. It also wires the Bright Data MCP server into your project so any AI agent that runs against the project (Claude Code, Cursor, Cline, Kiro itself) gains live web tools. + +**It works on any language**, but Python and TypeScript/JavaScript get first-class code generation. Other languages get a generic `curl`/HTTP template that you adapt. + +## What you can do + +- "Scrape competitor prices from example.com daily into my Next.js dashboard" +- "Add a `/api/scrape` route to my Express app" +- "Give my Claude SDK agent a tool that searches Google and reads results" +- "Extract all product listings from this Shopify store" +- "Monitor LinkedIn profiles of my pipeline contacts" + +## Onboarding + +Before using this power, complete the following. + +### Step 1: Get a Bright Data API token + +Sign up (or log in) at [Sign up at brightdata.com](https://brightdata.com). Generate an API token at [API token settings](https://brightdata.com/cp/setting/users). The free tier includes **5,000 requests per month** including Pro tools. + +### Step 2: Configure the token (pick one) + +**Option A — Env var (recommended for CI / production):** + +```bash +export BRIGHTDATA_API_KEY= +``` + +Add to your shell profile (`.zshrc`, `.bashrc`) or a project `.env` file. The generated `mcp.json` references `${BRIGHTDATA_API_KEY}`, so this works automatically. + +**Option B — Hardcoded in user-level Kiro config:** + +Edit `~/.kiro/settings/mcp.json` and add: + +```json +{ + "mcpServers": { + "brightdata": { + "url": "https://mcp.brightdata.com/mcp?token=YOUR_TOKEN_HERE", + "disabled": false + } + } +} +``` + +This makes Bright Data MCP available in every Kiro project on your machine. + +### Step 3 (optional): Set up an Unlocker zone + +The default Web Unlocker zone is named `unblocker` on new accounts. If you've renamed it or hit "no zone" errors, set: + +```bash +export BRIGHTDATA_UNLOCKER_ZONE= +``` + +Or create / rename a zone at [zone management page](https://brightdata.com/cp/zones). + +--- + +## How to use this power + +For any scraping task, **always** read the orchestrator steering file first: + +``` +Call action "readSteering" with powerName="brightdata-scrape", steeringFile="scrape-workflow.md" +``` + +The orchestrator runs four phases in sequence with confirmation gates between each: + +1. **Detect & plan** — inspect the project, pick the right integration pattern, ask what to scrape. +2. **Scraping playbook** — pick the right Bright Data API and selectors based on the target site. +3. **Integrate** — generate the scraper module, API route, or agent tool into the user's project. +4. **MCP & verify** — wire the Bright Data MCP server, run a smoke test, write a README snippet. + +**Do NOT improvise. Do NOT skip phases.** The steering files contain the exact instructions for each. diff --git a/brightdata-scrape/mcp.json b/brightdata-scrape/mcp.json new file mode 100644 index 0000000..b42f8da --- /dev/null +++ b/brightdata-scrape/mcp.json @@ -0,0 +1,8 @@ +{ + "mcpServers": { + "brightdata": { + "type": "http", + "url": "https://mcp.brightdata.com/mcp?token=${BRIGHTDATA_API_KEY}" + } + } +} diff --git a/brightdata-scrape/steering/phase1-detect-and-plan.md b/brightdata-scrape/steering/phase1-detect-and-plan.md new file mode 100644 index 0000000..80302bb --- /dev/null +++ b/brightdata-scrape/steering/phase1-detect-and-plan.md @@ -0,0 +1,160 @@ +# Phase 1: Detect & Plan + +Inspect the user's workspace, classify it, and propose a single integration pattern. Confirm with the user before proceeding to Phase 2. + +## Step 1: Manifest detection + +Look for these manifest files at the workspace root, in order: + +| Manifest | Language | +|----------|----------| +| `package.json` | TypeScript / JavaScript | +| `pyproject.toml`, `requirements.txt`, `Pipfile` | Python | +| `go.mod` | Go | +| `Cargo.toml` | Rust | +| `Gemfile` | Ruby | +| `pom.xml`, `build.gradle` | Java / Kotlin | +| `composer.json` | PHP | + +**No manifest found** → **greenfield mode**. Skip to Step 4 and ask the user: "Python or TypeScript?" + +**Multiple manifests in subdirectories (monorepo)** → ask the user which sub-project to integrate into. Do **not** pick automatically. + +## Step 2: Stack signature + +Within the chosen manifest, look for these dependency signatures: + +### Web framework dependencies (route pattern) + +**TypeScript/JavaScript** — look in `dependencies` and `devDependencies`: +- `next`, `next.js` → Next.js +- `express` → Express +- `fastify` → Fastify +- `hono` → Hono +- `koa`, `koa-router` → Koa +- `@nestjs/core` → NestJS +- `remix`, `@remix-run/react` → Remix + +**Python** — look in `pyproject.toml` `[project.dependencies]` or `requirements.txt`: +- `fastapi` → FastAPI +- `flask` → Flask +- `django` → Django +- `aiohttp` → aiohttp +- `starlette` → Starlette + +### Agent framework dependencies (agent tool pattern) + +**TypeScript/JavaScript:** +- `langchain`, `@langchain/*` → LangChain +- `@anthropic-ai/sdk` → Anthropic SDK +- `openai` → OpenAI SDK +- `mastra`, `@mastra/*` → Mastra +- `@ai-sdk/*`, `ai` → Vercel AI SDK +- `@modelcontextprotocol/sdk` → MCP SDK + +**Python:** +- `langchain`, `langchain-*` → LangChain +- `anthropic` → Anthropic SDK +- `openai` → OpenAI SDK +- `llama-index` → LlamaIndex +- `crewai` → CrewAI +- `mcp` → MCP SDK + +## Step 3: Pick the single pattern + +Apply this decision tree: + +| Signals | Pattern | Template family | +|---------|---------|-----------------| +| Web framework deps present | **API route** | `templates/route/` | +| Agent framework deps present, no web framework | **Agent tool** | `templates/tool/` | +| Both web and agent | Ask the user which surface they prefer (route or tool) | +| Library / CLI / unrecognized | **Module** | `templates/module/` | +| Greenfield (no manifest) | **Module**, language asked above | `templates/module/` | + +**Inform the user of the choice but don't put it up for vote** — the detection is the choice. + +## Step 4: Targeted scraping question + +Ask in **one** message: + +> What do you want to scrape, and which fields? (Examples: 'product name + price + image from amazon.com search results', 'all job titles + companies from greenhouse.io listings'.) Roughly how many pages or items? + +Do NOT ask separately about pagination type, output format, or language preference: +- **Pagination type** → discovered in Phase 2 reconnaissance. +- **Output format** → defaults to typed return value; the caller decides what to do with it. +- **Language** → already determined by manifest (TypeScript for `package.json`, Python for `pyproject.toml`/`requirements.txt`); greenfield is handled in Step 1. + +### Canonical framework slugs + +When emitting the decision record's `framework` field, use these exact slugs (so Phase 3's template lookup works): + +| Framework | Slug | +|-----------|------| +| Next.js (App Router) | `next-app-router` | +| Next.js (Pages Router) | `next-pages-router` | +| Express | `express` | +| Fastify | `fastify` | +| Hono | `hono` | +| Koa | `koa` | +| NestJS | `nestjs` | +| Remix | `remix` | +| FastAPI | `fastapi` | +| Flask | `flask` | +| Django | `django` | +| aiohttp | `aiohttp` | +| Starlette | `starlette` | +| LangChain (TS) | `langchain-ts` | +| LangChain (Python) | `langchain-py` | +| Anthropic SDK (TS) | `anthropic-sdk-ts` | +| Anthropic SDK (Python) | `anthropic-sdk-py` | +| OpenAI SDK (TS) | `openai-ts` | +| OpenAI SDK (Python) | `openai-py` | +| Mastra | `mastra` | +| Vercel AI SDK | `vercel-ai-sdk` | +| LlamaIndex | `llama-index` | +| CrewAI | `crewai` | + +If the detected framework isn't in this list, use the bare lowercase package name and tell the user a fallback template will be used. + +## Step 5: Present plan and wait for confirmation + +Format: + +``` +## Plan + +**Detected:** +**Pattern:** +**Scraping target:** +**Estimated volume:** =50)> + +**Phases I'll run:** + 1. ✓ Detect & plan (this phase, just confirmed) + 2. Scraping playbook — pick the right Bright Data API and selectors + 3. Integrate — generate in your project + 4. MCP & verify — wire Bright Data MCP, run a smoke test + +Ready to proceed to Phase 2? +``` + +**WAIT for the user to confirm before returning control to the orchestrator.** + +## Output of this phase + +The orchestrator carries forward this decision record: + +``` +{ + "language": "typescript" | "python" | "other", + "language_other_name": "" | null, + "pattern": "module" | "route" | "tool", + "framework": "", + "target_url": "", + "fields": ["", "", ...], + "volume": "single" | "small" | "bulk", + "monorepo_subproject": "" | null +} +``` + +Subsequent phases reference this record by name. If anything is missing from it, return to Phase 1. diff --git a/brightdata-scrape/steering/phase2-scraping-playbook.md b/brightdata-scrape/steering/phase2-scraping-playbook.md new file mode 100644 index 0000000..b4d8228 --- /dev/null +++ b/brightdata-scrape/steering/phase2-scraping-playbook.md @@ -0,0 +1,85 @@ +# Phase 2: Scraping Playbook (Condensed) + +Pick the right Bright Data API and selectors for the user's target. Output a decision record the next phase consumes. + +> This is a **condensed** version of the long-form `scraper-builder` skill. For deeper analysis (anti-bot escalation, multi-site parallelism, retry semantics, browser-session reuse), the brightdata-plugin `scraper-builder` skill is the reference. This phase is a working subset. + +## Step 1: Decision tree (the four-line core) + +1. **Pre-built scraper exists for this domain?** → use **Web Scraper API** (zero parsing, structured JSON). +2. **Static HTML, no interaction needed?** → use **Web Unlocker** (cheapest, simplest). +3. **JS-rendered, or needs clicks/scrolls/form fills?** → use **Browser API** (full automation). +4. **Search engine results page (Google/Bing/Yandex)?** → use **SERP API**. + +## Step 2: Pre-built scraper check + +One curl to discover all available pre-built scrapers: + +```bash +curl -H "Authorization: Bearer $BRIGHTDATA_API_KEY" \ + https://api.brightdata.com/datasets/list +``` + +Search the response for the target domain. If you find a match, record the `dataset_id` and **skip to Step 5** — no reconnaissance needed. + +Common matches: `amazon`, `linkedin`, `instagram`, `tiktok`, `youtube`, `facebook`, `twitter` (X), `reddit`, `walmart`, `ebay`, `crunchbase`, `zillow`, `booking`, `yahoo-finance`, `google-play`, `apple-app-store`. + +## Step 3: Reconnaissance (no pre-built scraper) + +### Step 3a: Fetch raw HTML + +```bash +curl -X POST https://api.brightdata.com/request \ + -H "Authorization: Bearer $BRIGHTDATA_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"zone": "'"${BRIGHTDATA_UNLOCKER_ZONE:-unblocker}"'", "url": "", "format": "raw"}' +``` + +### Step 3b: Inspect + +- **Is the data in the HTML?** If yes → Web Unlocker is sufficient. +- **Is the HTML mostly an empty shell** (`
`, `
`, `ng-app`)? → content is client-rendered → **escalate to Browser API**. +- **Does the page reference internal JSON endpoints** (look for `fetch('/api/...')`, `XMLHttpRequest`, hardcoded JSON in `