Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,13 @@ Documentation is available at https://kiro.dev/docs/powers/

---

### brightdata-scrape
**Add web scraping to any app with Bright Data** - Detects your project's stack (Python, TypeScript, 9 web frameworks, 8 agent frameworks) and adds production-ready Bright Data web scraping in the right shape — a reusable module, an API route, or an agent tool — backed by Web Unlocker, SERP API, Web Scraper API, and Browser API. Wires the Bright Data MCP server into the project so any AI agent that runs against the project (Claude Code, Cursor, Cline, Kiro itself) gains live web tools.

**MCP Servers:** brightdata

---

### cloud-architect
**Build infrastructure on AWS** - Build AWS infrastructure with CDK in Python following AWS Well-Architected framework best practices.

Expand Down
85 changes: 85 additions & 0 deletions brightdata-scrape/POWER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
name: "brightdata-scrape"
displayName: "Add web scraping to any app with Bright Data"
description: "Detect your project's stack and add production-ready web scraping — generates the right integration pattern (module, API route, or agent tool), wires Bright Data MCP into your project, handles pagination and bot detection."
keywords: ["scrape", "scraping", "scraper", "crawl", "crawler", "web-data", "extract", "extract-data", "competitor", "pricing-monitor", "lead-generation", "amazon", "linkedin", "instagram", "tiktok", "youtube", "serp", "google-search", "search-engine", "brightdata", "bright-data", "web-unlocker", "browser-api", "captcha", "bot-detection", "pagination", "agent-tools", "mcp"]
author: "Bright Data"
---

# Add web scraping to any app with Bright Data

This power detects what kind of project you're working on and adds production-ready scraping in the right shape — a reusable module, an API route, or an agent tool — backed by Bright Data's Web Unlocker, Web Scraper API, Browser API, and SERP API. It also wires the Bright Data MCP server into your project so any AI agent that runs against the project (Claude Code, Cursor, Cline, Kiro itself) gains live web tools.

**It works on any language**, but Python and TypeScript/JavaScript get first-class code generation. Other languages get a generic `curl`/HTTP template that you adapt.

## What you can do

- "Scrape competitor prices from example.com daily into my Next.js dashboard"
- "Add a `/api/scrape` route to my Express app"
- "Give my Claude SDK agent a tool that searches Google and reads results"
- "Extract all product listings from this Shopify store"
- "Monitor LinkedIn profiles of my pipeline contacts"

## Onboarding

Before using this power, complete the following.

### Step 1: Get a Bright Data API token

Sign up (or log in) at [Sign up at brightdata.com](https://brightdata.com). Generate an API token at [API token settings](https://brightdata.com/cp/setting/users). The free tier includes **5,000 requests per month** including Pro tools.

### Step 2: Configure the token (pick one)

**Option A — Env var (recommended for CI / production):**

```bash
export BRIGHTDATA_API_KEY=<your-token>
```

Add to your shell profile (`.zshrc`, `.bashrc`) or a project `.env` file. The generated `mcp.json` references `${BRIGHTDATA_API_KEY}`, so this works automatically.

**Option B — Hardcoded in user-level Kiro config:**

Edit `~/.kiro/settings/mcp.json` and add:

```json
{
"mcpServers": {
"brightdata": {
"url": "https://mcp.brightdata.com/mcp?token=YOUR_TOKEN_HERE",
"disabled": false
}
}
}
```

This makes Bright Data MCP available in every Kiro project on your machine.

### Step 3 (optional): Set up an Unlocker zone

The default Web Unlocker zone is named `unblocker` on new accounts. If you've renamed it or hit "no zone" errors, set:

```bash
export BRIGHTDATA_UNLOCKER_ZONE=<your-zone-name>
```

Or create / rename a zone at [zone management page](https://brightdata.com/cp/zones).

---

## How to use this power

For any scraping task, **always** read the orchestrator steering file first:

```
Call action "readSteering" with powerName="brightdata-scrape", steeringFile="scrape-workflow.md"
```

The orchestrator runs four phases in sequence with confirmation gates between each:

1. **Detect & plan** — inspect the project, pick the right integration pattern, ask what to scrape.
2. **Scraping playbook** — pick the right Bright Data API and selectors based on the target site.
3. **Integrate** — generate the scraper module, API route, or agent tool into the user's project.
4. **MCP & verify** — wire the Bright Data MCP server, run a smoke test, write a README snippet.

**Do NOT improvise. Do NOT skip phases.** The steering files contain the exact instructions for each.
8 changes: 8 additions & 0 deletions brightdata-scrape/mcp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"mcpServers": {
"brightdata": {
"type": "http",
"url": "https://mcp.brightdata.com/mcp?token=${BRIGHTDATA_API_KEY}"
}
}
}
160 changes: 160 additions & 0 deletions brightdata-scrape/steering/phase1-detect-and-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Phase 1: Detect & Plan

Inspect the user's workspace, classify it, and propose a single integration pattern. Confirm with the user before proceeding to Phase 2.

## Step 1: Manifest detection

Look for these manifest files at the workspace root, in order:

| Manifest | Language |
|----------|----------|
| `package.json` | TypeScript / JavaScript |
| `pyproject.toml`, `requirements.txt`, `Pipfile` | Python |
| `go.mod` | Go |
| `Cargo.toml` | Rust |
| `Gemfile` | Ruby |
| `pom.xml`, `build.gradle` | Java / Kotlin |
| `composer.json` | PHP |

**No manifest found** → **greenfield mode**. Skip to Step 4 and ask the user: "Python or TypeScript?"

**Multiple manifests in subdirectories (monorepo)** → ask the user which sub-project to integrate into. Do **not** pick automatically.

## Step 2: Stack signature

Within the chosen manifest, look for these dependency signatures:

### Web framework dependencies (route pattern)

**TypeScript/JavaScript** — look in `dependencies` and `devDependencies`:
- `next`, `next.js` → Next.js
- `express` → Express
- `fastify` → Fastify
- `hono` → Hono
- `koa`, `koa-router` → Koa
- `@nestjs/core` → NestJS
- `remix`, `@remix-run/react` → Remix

**Python** — look in `pyproject.toml` `[project.dependencies]` or `requirements.txt`:
- `fastapi` → FastAPI
- `flask` → Flask
- `django` → Django
- `aiohttp` → aiohttp
- `starlette` → Starlette

### Agent framework dependencies (agent tool pattern)

**TypeScript/JavaScript:**
- `langchain`, `@langchain/*` → LangChain
- `@anthropic-ai/sdk` → Anthropic SDK
- `openai` → OpenAI SDK
- `mastra`, `@mastra/*` → Mastra
- `@ai-sdk/*`, `ai` → Vercel AI SDK
- `@modelcontextprotocol/sdk` → MCP SDK

**Python:**
- `langchain`, `langchain-*` → LangChain
- `anthropic` → Anthropic SDK
- `openai` → OpenAI SDK
- `llama-index` → LlamaIndex
- `crewai` → CrewAI
- `mcp` → MCP SDK

## Step 3: Pick the single pattern

Apply this decision tree:

| Signals | Pattern | Template family |
|---------|---------|-----------------|
| Web framework deps present | **API route** | `templates/route/` |
| Agent framework deps present, no web framework | **Agent tool** | `templates/tool/` |
| Both web and agent | Ask the user which surface they prefer (route or tool) |
| Library / CLI / unrecognized | **Module** | `templates/module/` |
| Greenfield (no manifest) | **Module**, language asked above | `templates/module/` |

**Inform the user of the choice but don't put it up for vote** — the detection is the choice.

## Step 4: Targeted scraping question

Ask in **one** message:

> What do you want to scrape, and which fields? (Examples: 'product name + price + image from amazon.com search results', 'all job titles + companies from greenhouse.io listings'.) Roughly how many pages or items?

Do NOT ask separately about pagination type, output format, or language preference:
- **Pagination type** → discovered in Phase 2 reconnaissance.
- **Output format** → defaults to typed return value; the caller decides what to do with it.
- **Language** → already determined by manifest (TypeScript for `package.json`, Python for `pyproject.toml`/`requirements.txt`); greenfield is handled in Step 1.

### Canonical framework slugs

When emitting the decision record's `framework` field, use these exact slugs (so Phase 3's template lookup works):

| Framework | Slug |
|-----------|------|
| Next.js (App Router) | `next-app-router` |
| Next.js (Pages Router) | `next-pages-router` |
| Express | `express` |
| Fastify | `fastify` |
| Hono | `hono` |
| Koa | `koa` |
| NestJS | `nestjs` |
| Remix | `remix` |
| FastAPI | `fastapi` |
| Flask | `flask` |
| Django | `django` |
| aiohttp | `aiohttp` |
| Starlette | `starlette` |
| LangChain (TS) | `langchain-ts` |
| LangChain (Python) | `langchain-py` |
| Anthropic SDK (TS) | `anthropic-sdk-ts` |
| Anthropic SDK (Python) | `anthropic-sdk-py` |
| OpenAI SDK (TS) | `openai-ts` |
| OpenAI SDK (Python) | `openai-py` |
| Mastra | `mastra` |
| Vercel AI SDK | `vercel-ai-sdk` |
| LlamaIndex | `llama-index` |
| CrewAI | `crewai` |

If the detected framework isn't in this list, use the bare lowercase package name and tell the user a fallback template will be used.

## Step 5: Present plan and wait for confirmation

Format:

```
## Plan

**Detected:** <stack signature, e.g., "Next.js 14 (App Router) project, TypeScript">
**Pattern:** <module | API route | agent tool>
**Scraping target:** <user's site/data>
**Estimated volume:** <single | small (<50) | bulk (>=50)>

**Phases I'll run:**
1. ✓ Detect & plan (this phase, just confirmed)
2. Scraping playbook — pick the right Bright Data API and selectors
3. Integrate — generate <files I'll write> in your project
4. MCP & verify — wire Bright Data MCP, run a smoke test

Ready to proceed to Phase 2?
```

**WAIT for the user to confirm before returning control to the orchestrator.**

## Output of this phase

The orchestrator carries forward this decision record:

```
{
"language": "typescript" | "python" | "other",
"language_other_name": "<e.g., 'go'>" | null,
"pattern": "module" | "route" | "tool",
"framework": "<e.g., 'next-app-router', 'fastapi', 'anthropic-sdk-ts'>",
"target_url": "<user's site>",
"fields": ["<field1>", "<field2>", ...],
"volume": "single" | "small" | "bulk",
"monorepo_subproject": "<path>" | null
}
```

Subsequent phases reference this record by name. If anything is missing from it, return to Phase 1.
85 changes: 85 additions & 0 deletions brightdata-scrape/steering/phase2-scraping-playbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Phase 2: Scraping Playbook (Condensed)

Pick the right Bright Data API and selectors for the user's target. Output a decision record the next phase consumes.

> This is a **condensed** version of the long-form `scraper-builder` skill. For deeper analysis (anti-bot escalation, multi-site parallelism, retry semantics, browser-session reuse), the brightdata-plugin `scraper-builder` skill is the reference. This phase is a working subset.

## Step 1: Decision tree (the four-line core)

1. **Pre-built scraper exists for this domain?** → use **Web Scraper API** (zero parsing, structured JSON).
2. **Static HTML, no interaction needed?** → use **Web Unlocker** (cheapest, simplest).
3. **JS-rendered, or needs clicks/scrolls/form fills?** → use **Browser API** (full automation).
4. **Search engine results page (Google/Bing/Yandex)?** → use **SERP API**.

## Step 2: Pre-built scraper check

One curl to discover all available pre-built scrapers:

```bash
curl -H "Authorization: Bearer $BRIGHTDATA_API_KEY" \
https://api.brightdata.com/datasets/list
```

Search the response for the target domain. If you find a match, record the `dataset_id` and **skip to Step 5** — no reconnaissance needed.

Common matches: `amazon`, `linkedin`, `instagram`, `tiktok`, `youtube`, `facebook`, `twitter` (X), `reddit`, `walmart`, `ebay`, `crunchbase`, `zillow`, `booking`, `yahoo-finance`, `google-play`, `apple-app-store`.

## Step 3: Reconnaissance (no pre-built scraper)

### Step 3a: Fetch raw HTML

```bash
curl -X POST https://api.brightdata.com/request \
-H "Authorization: Bearer $BRIGHTDATA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"zone": "'"${BRIGHTDATA_UNLOCKER_ZONE:-unblocker}"'", "url": "<TARGET_URL>", "format": "raw"}'
```

### Step 3b: Inspect

- **Is the data in the HTML?** If yes → Web Unlocker is sufficient.
- **Is the HTML mostly an empty shell** (`<div id="root"></div>`, `<div id="__next"></div>`, `ng-app`)? → content is client-rendered → **escalate to Browser API**.
- **Does the page reference internal JSON endpoints** (look for `fetch('/api/...')`, `XMLHttpRequest`, hardcoded JSON in `<script>` tags)? → hitting the API endpoint directly via Web Unlocker is cleaner than HTML parsing.

### Step 3c: Pick selectors (priority order)

Prefer data attributes (`data-*`) over structural selectors — they survive redesigns.

1. **`data-*` attributes** (e.g., `[data-testid="product-price"]`) — survive redesigns.
2. **Semantic class names** (e.g., `.product-card .price`).
3. **`id` attributes** — unique, but may change.
4. **Structural selectors** (e.g., `div > span:nth-child(2)`) — fragile, **avoid**.

## Step 4: Pagination patterns

| Pattern | Detection | Implementation |
|---------|-----------|----------------|
| **URL-based** | URLs like `?page=N`, `?offset=20` | Loop `?page=N` until response has no items |
| **Next-link** | HTML has `<a rel="next">` or `.pagination .next a` | Follow the next-link until absent |
| **Cursor** | API responses include `next_cursor`, `next_token` | Loop with `?cursor=<token>` from previous response |
| **Infinite scroll** | Content loads on scroll, no URL change | Browser API only — scroll until item count stops growing |

## Step 5: Concurrency

If the user said "bulk" volume in Phase 1 (50+ URLs), the generated code **must** use semaphore-bounded async with `CONCURRENCY=20` as the default. Sequential `time.sleep(1)` loops at this volume are unacceptable.

## Step 6: Output the decision record

Hand back to the orchestrator:

```
{
"approach": "web-unlocker" | "browser-api" | "pre-built" | "serp",
"dataset_id": "<if pre-built>" | null,
"selectors": { "<field>": "<css selector>", ... },
"pagination": "url-based" | "next-link" | "cursor" | "infinite-scroll" | "none",
"concurrency_required": true | false,
"hidden_api_endpoint": "<URL or null>"
}
```

Confirm the choice with the user in one short message before returning to the orchestrator:

> I'll use **Web Unlocker** with these selectors: `<list>`. Pagination is **URL-based**. Sound right?

WAIT for confirmation. Then return.
Loading
Loading