A self-improving browser automation agent that learns from every success and failure, getting measurably faster and more accurate over time.
This isn't a scraper with a cache bolted on — the system literally cannot run without learning.
Built for WeaveHacks 3
https://github.com/user-attachments/assets/webscout-demo.mp4
https://github.com/nihalnihalani/WebScout/raw/main/demo-video/out/webscout-demo.mp4
40-second overview of WebScout's architecture and learning pipeline
Web scraping is fundamentally fragile. Sites change layouts overnight, cookie banners appear without warning, and A/B tests mean two users see completely different pages. Traditional scrapers break silently and require constant manual maintenance.
Every existing tool treats scraping as a stateless operation: fetch, parse, hope it works. When it breaks, a human fixes it. When it breaks again, the same human fixes the same thing.
WebScout takes a different approach. Every single task execution writes learning data back into the system. Successes reinforce pattern fitness. Failures increment failure counters and raise the confidence threshold. Recoveries store brand new patterns. The system cannot run without learning — it's not an optional feature, it's the architecture.
| Approach | Traditional Scrapers | WebScout |
|---|---|---|
| Failure handling | Break silently, wait for human | 4 adaptive recovery strategies, ordered by per-domain success rate |
| Pattern reuse | Manual selectors, hardcoded rules | 1536-dim vector embeddings with semantic KNN search — generalizes across similar pages |
| Confidence | Binary: works or doesn't | Wilson Score + time decay + dynamic threshold that adjusts itself |
| Observability | Logs at best | 7-level Weave integration with structured traces, retrospective feedback, formal evaluations |
| Learning | None | Every path writes back — successes, failures, and recoveries all teach the system |
| Staleness | Dead selectors accumulate | Patterns below fitness 0.05 with 3+ failures are auto-pruned |
| Improvement proof | "It seems better" | Cohort-based measurement with weighted scoring and letter grades |
- The Learning Loop
- Pattern Fitness Scoring
- Negative Learning
- Dynamic Confidence Threshold
- Adaptive Recovery Ordering
- Vector Similarity Search
- Weave — 7 Levels Deep
- Redis Architecture
- Real-Time SSE Streaming
- Quality Assessment
- Cohort-Based Improvement
- Gemini Pre-Analysis
- Browserbase + Stagehand
- Architecture
- Tech Stack Deep Dive
- Getting Started
- Project Structure
- API Reference
"Every single task execution writes learning data back into the system. Successes reinforce pattern fitness. Failures increment failure counters and raise the confidence threshold. Recoveries store brand new patterns. The system literally cannot run without learning."
The core learningScrape() function in scraper.ts runs a 5-stage pipeline on every task:
1. SEARCH 2. TRY CACHE 3. FRESH EXTRACT
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ Vector │─────>│ Cached │──miss─>│ Gemini │────>│ Direct │
│ KNN │ │ Pattern │ │ PreAnalyze│ │ Extract │
│ Search │ │ (>85%) │ │ (DOM) │ │ Stagehand │
└───────────┘ └─────┬─────┘ └───────────┘ └─────┬─────┘
│ hit │ fail
▼ ▼
5. OBSERVE 4. LEARN 3b. RECOVERY
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Weave │<─────│ Store │<─────────────│ 4-Strategy│
│ 7-Level │ │ Vector │ │ Adaptive │
│ Trace │ │ Embedding │ │ Recovery │
└───────────┘ └───────────┘ └───────────┘
| Stage | What Happens | Technical Detail | Code |
|---|---|---|---|
| 1. Vector Search | KNN query against Redis HNSW index | 1536-dim text-embedding-3-small vectors, cosine distance, top-10 candidates, query = {url_pattern} {target} |
vectors.ts |
| 2. Cache Match | Composite ranking + threshold check | composite = similarity × 0.6 + fitness × 0.4. Must exceed dynamic threshold (0.70-0.95) stored in Redis. |
scraper.ts |
| 3. Gemini Pre-Analysis | DOM snapshot sent to Gemini before extraction | Gemini 2.0-flash analyzes page structure, suggests CSS selectors and extraction strategy with reasoning | gemini.ts |
| 3a. Fresh Extraction | Stagehand AI extracts from live page | Browserbase cloud browser + Stagehand extract() with Zod schemas for structured output |
stagehand-client.ts |
| 3b. Recovery | 4 strategies in adaptive order | Per-domain stats determine order. Agent → Act → Extract Refined → Gemini. Each outcome recorded. | recovery.ts |
| 4. Learn | Write back to the system | Store pattern as vector embedding, update fitness, record strategy outcome, adjust threshold | pattern-extractor.ts |
| 5. Observe | Full Weave tracing | Custom summarize() returns 8 structured metrics. 4 retrospective feedback scores attached via trace server API. |
weave.ts |
Every path writes back. Cache hit? Update success_count and last_succeeded_at. Cache miss? Fresh extraction stores a new pattern. Cache failure? Increment failure_count, raise threshold by 0.02. Recovery succeeds? Store the recovery pattern. Recovery fails? Record the failure for that strategy on that domain. The system cannot execute a task without learning from it.
"We use the Wilson Score Lower Bound — the same algorithm Reddit uses to rank comments — to get a statistically reliable success estimate even for patterns with few data points. Then we multiply by exponential time decay with a 30-day half-life, because a pattern that worked 6 months ago on a site that's probably changed isn't worth much."
Implemented in pattern-fitness.ts — a pure synchronous function called in the ranking pipeline:
Step 1: Wilson Score Lower Bound (95% confidence)
p = success_count / total
z = 1.96 (95% CI)
denominator = 1 + z²/n
center = p + z²/2n
spread = z × √(p(1-p)/n + z²/4n²)
wilsonLower = (center - spread) / denominator
Why Wilson and not raw success rate? A pattern with 3/3 successes (100%) ranks lower than one with 50/52 (96%) because we have far less confidence in the small sample. Wilson penalizes uncertainty — this prevents new patterns from outranking battle-tested ones.
Step 2: Exponential Time Decay (30-day half-life)
daysSinceActivity = (now - lastActivity) / (1000 × 60 × 60 × 24)
decay = 0.5 ^ (daysSinceActivity / 30)
A 90-day-old pattern retains only 12.5% of its score. Sites change — patterns must prove themselves continuously.
Step 3: Recency Bonus
+0.15 if last success < 24 hours ago
+0.075 if last success < 7 days ago
Step 4: Final Composite
fitness = clamp(0, 1, wilsonLower × decay + recencyBonus)
composite = vectorSimilarity × 0.6 + fitness × 0.4
Patterns compete on both semantic relevance and proven reliability.
"Most caching systems only remember successes. Ours tracks failures too. If a pattern starts failing — maybe the site redesigned — its fitness score drops. Below 0.2 it gets filtered from results. Below 0.05 with 3+ failures it gets auto-pruned. The system actively forgets what doesn't work anymore."
Implemented across vectors.ts, scraper.ts, pattern-pruner.ts:
Every pattern in Redis carries:
{
success_count: number,
failure_count: number,
last_succeeded_at: number, // timestamp
last_failed_at: number, // timestamp
}| Event | System Response | Function |
|---|---|---|
| Cached pattern succeeds | success_count++, update last_succeeded_at |
updatePatternLastSuccess() |
| Cached pattern fails | failure_count++, update last_failed_at |
incrementPatternFailure() |
| Fitness < 0.2 | Pattern filtered from KNN results — exists but never used | Ranking filter in scraper.ts |
Fitness < 0.05 AND failure_count ≥ 3 |
Pattern auto-deleted from Redis entirely | pruneDeadPatterns() |
This is how the system handles site redesigns gracefully. No human intervention needed — dead patterns decay, get filtered, and eventually get pruned.
"The confidence threshold for using cached patterns isn't static. It adjusts itself. Every successful cache hit lowers the threshold by half a percent — the system gets bolder. Every cache failure raises it by 2 percent — four times the penalty. It's asymmetric because a false positive is more expensive than a false negative."
Implemented in pattern-extractor.ts:
Redis key: webscout:confidence_threshold
Range: [0.70, 0.95]
Default: 0.85
Cache success: threshold -= 0.005 (bolder)
Cache failure: threshold += 0.02 (cautious)
Penalty ratio: 4:1 (failures penalize 4× more than successes reward)
Why asymmetric? A false positive (replaying a bad cached pattern) wastes a browser session, burns API credits, and delays the task. A false negative (doing a fresh extraction when cache would have worked) costs more compute but still succeeds. The 4:1 ratio reflects this cost asymmetry.
The threshold persists in Redis across deployments and restarts.
"Recovery isn't random. We track per-domain success rates for each of our four recovery strategies in Redis. If agent-mode recovery works 80% of the time on Amazon but only 20% on Wikipedia, the system learns that and reorders accordingly. It's a multi-armed bandit — we try the best-known strategy first, track duration, and adapt over time."
Implemented in strategy-selector.ts and recovery.ts:
| # | Strategy | What It Does | Powered By | Best For |
|---|---|---|---|---|
| 1 | Agent | Autonomous GPT-4o agent that reasons about the page, navigates complex flows, handles multi-step interactions | Stagehand agent.execute() |
Complex SPAs, multi-step flows |
| 2 | Act | Blocker removal — dismiss cookie banners, close modals, scroll past overlays, then re-extract | Stagehand act() |
Cookie walls, GDPR popups, paywalls |
| 3 | Extract Refined | Re-extraction with enriched instructions targeting main content, sidebars, tables, metadata | Stagehand extract() + hints |
Pages where initial selectors were too broad |
| 4 | Gemini | Full DOM snapshot analyzed by Gemini, which suggests alternative CSS selectors for a different approach | Gemini 2.0-flash | Unusual page structures, non-standard layouts |
Redis key: strategy_stats:{urlPattern}:{strategy}
Fields: { attempts: number, successes: number, avg_duration_ms: number }
recordStrategyOutcome()updates stats with running average duration after each attemptgetOrderedStrategies()sorts by success rate descending, breaking ties by avg duration ascending- The system learns that different sites respond to different strategies
"Pattern matching isn't keyword-based — it's semantic. We embed every URL pattern and target description into a 1536-dimensional vector using OpenAI embeddings, store them in a Redis HNSW index, and do KNN cosine similarity search. So if you taught WebScout to extract prices from amazon.com/products, and a new task comes in for amazon.com/electronics, the vector search finds that pattern because the semantic meaning is similar, not because the strings match."
Implemented in vectors.ts:
Index: idx:page_patterns
Algorithm: HNSW (Hierarchical Navigable Small World)
Type: FLOAT32
Dimensions: 1536
Distance: COSINE
Embeddings: OpenAI text-embedding-3-small
TEXT fields: url_pattern, target, working_selector
TAG fields: approach
NUMERIC fields: created_at, success_count, failure_count, last_succeeded_at, last_failed_at
VECTOR field: embedding (1536-dim HNSW)
- Build query text:
{url_pattern} {target} - Embed via
text-embedding-3-small→ 1536-dim vector - KNN search:
*=>[KNN 10 @embedding $BLOB AS vector_score] - Normalize:
similarity = 1 - vectorScore / 2 - Re-rank by composite:
similarity × 0.6 + fitness × 0.4
Keyword matching fails across URL variations. Vector search generalizes:
amazon.com/dp/B08N5→ matchesamazon.com/products/electronics(product pages)news.ycombinator.com→ matcheslobste.rs(link aggregators)github.com/user/repo→ matchesgitlab.com/user/project(code hosting)
The learning transfers across semantically similar pages, not just exact URL matches.
"This isn't
weave.init()and done. The entire feedback loop flows through Weave."
Implemented in weave.ts, trace-context.ts, batch-eval.ts, weave-eval-logger.ts:
Every meaningful function is a weave.op() with a custom summarize() callback that returns structured metrics — not just "it ran":
| Operation | Metrics in Summary |
|---|---|
learningScrape |
success, used_cache, recovery_attempted, recovery_succeeded, pattern_learned, quality_score, duration_ms, steps_count |
attemptRecovery |
recovery_success, recovery_strategy |
pruneDeadPatterns |
patterns_pruned, patterns_remaining |
searchSimilarPatterns |
match count, top similarity scores |
storePattern |
pattern ID, selector, approach |
adjustConfidenceThreshold |
new value, direction, delta |
learningScrape is created via createInvocableOp(). The API route calls .invoke() which returns [result, Call]. The Call.id is essential for Level 3.
After every task completes, addScoreToCall() uses the raw Weave trace server API (traceServerApi.feedback.feedbackCreateFeedbackCreatePost()) to attach 4 feedback scores to the call:
| Score Key | Type | What It Captures |
|---|---|---|
webscout.success |
boolean → 0/1 | Did the extraction produce valid data? |
webscout.quality |
0-100 + comment | GPT-4o quality assessment of extracted content |
webscout.used_cache |
boolean | Was a cached pattern used? (tracks cache adoption) |
webscout.recovery_needed |
boolean | Did recovery strategies fire? (tracks reliability) |
This is not the evaluation-only addScore() — it's the raw feedback endpoint that attaches scores to any call.
savePatternDataset() creates a versioned weave.Dataset named "webscout-learned-patterns". Each save creates a new version. You can track exactly how the pattern library evolves over time in the Weave UI — what patterns were added, which ones gained fitness, which ones decayed.
createWeaveImage(base64) converts browser screenshots to weave.weaveImage() format. Screenshots render inline in the Weave trace viewer — you can see exactly what the agent saw at each step of the pipeline without leaving the trace UI.
4 typed scorers, each a weave.op():
| Scorer | Logic |
|---|---|
webscout.scorer.success |
Binary: 1.0 if extraction succeeded, 0.0 if not |
webscout.scorer.speed |
Linear interpolation: ≤3s → 1.0, ≥60s → 0.0 |
webscout.scorer.cache_efficiency |
1.0 if cached, 0.5 if fresh, 0.25 if recovery needed |
webscout.scorer.quality |
Normalized extraction quality from GPT-4o assessment |
A replayModel replays historical task results as a Weave model. weave.Evaluation runs all 4 scorers over a weave.Dataset built from task history — producing formal evaluation results in the Weave dashboard.
withWeaveAttributes() propagates taskId, urlPattern, target, and sessionType to all child spans. Every operation in a task inherits these attributes, enabling filtering in the Weave UI by any dimension — "show me all traces for amazon.com" or "show me all cache misses."
"Redis isn't just our cache — it's our entire learning state. All learning state is in Redis, which means the system picks up exactly where it left off across restarts and deployments."
Implemented across vectors.ts, tasks.ts, client.ts:
| Key Pattern | Redis Type | What It Stores |
|---|---|---|
pattern:{uuid} |
Hash | Complete pattern: 1536-dim embedding, working selector, approach, success_count, failure_count, last_succeeded_at, last_failed_at, created_at |
idx:page_patterns |
RediSearch Index | HNSW vector index over all pattern:* hashes. Config: FLOAT32, 1536 dims, COSINE distance. Enables sub-millisecond KNN search. |
task:{id} |
Hash | Full task data: URL, target, status, all execution steps with screenshots, pattern used, recovery attempts, quality score, timestamps |
tasks:timeline |
Sorted Set | Task IDs scored by creation timestamp. Enables chronological listing and cohort splitting for evaluation. |
strategy_stats:{urlPattern}:{strategy} |
Hash | Per-domain, per-strategy stats: attempts, successes, avg_duration_ms. Powers adaptive recovery ordering. |
webscout:confidence_threshold |
String | Single float value (0.70-0.95). The dynamic confidence threshold that adjusts on every cache hit/miss. |
Why Redis for everything? All learning state in one place means:
- System picks up exactly where it left off after restart
- No cold start — patterns, stats, and threshold persist across deployments
- Vector search + key-value storage + sorted sets in a single service
- Sub-millisecond reads for the hot path (pattern lookup)
"When you submit a task, the dashboard connects via Server-Sent Events. The server polls Redis every 500ms and streams step-by-step updates — you see each extraction attempt, each recovery strategy, each pattern learned — as it happens."
Server — api/tasks/[id]/stream/route.ts
ReadableStreamwithTextEncoderfor SSE protocol- Polls Redis every 500ms for new steps via
getTask() - 120-second safety timeout prevents zombie connections
- Named events:
done(task complete),error(server error)
Client — use-task-stream.ts
- Native
EventSourceAPI with automatic reconnect on transient errors - Preserves last known state during disconnects — no UI flicker
- Merges incremental step updates into the task state
Progress Flushing — tasks.ts
The scraper calls flushProgress() via updateTaskProgress() after every significant step — cache hit, extraction attempt, recovery start, pattern store. This writes intermediate state to Redis so the SSE stream picks up steps in near real-time, even while the scraper is still working on a different part of the pipeline.
"Every extraction gets a quality score from GPT-4o — not just pass/fail, but a 0-100 assessment of how complete and accurate the extracted data is. That score feeds into the evaluation pipeline and shows up as Weave feedback on the call."
Implemented in openai-quality.ts:
- After every successful extraction:
assessExtractionQuality()calls GPT-4o with the extracted data, the target description, and the source URL - Returns: 0-100 quality score + text summary explaining the assessment
- Score is attached to the Weave call as
webscout.qualityfeedback (Level 3) - Feeds into evaluation pipeline with 20% weight in overall improvement score
- Distinguishes between "it extracted something" and "it extracted the right thing well"
"We prove improvement quantitatively. We split task history into thirds and compute deltas across five metrics. After enough tasks, you can literally see the grade go from D to A as the system learns. This isn't a claim — it's measured."
Implemented in batch-eval.ts and /api/evaluation:
- Sort all tasks chronologically
- Split into three cohorts: early (first third), middle, late (last third)
- Compute 6 metrics per cohort:
success_rate,avg_duration,cache_hit_rate,recovery_rate,avg_quality_score,task_count - Compute deltas between early and late cohorts
- Calculate weighted improvement score
| Dimension | Weight | Rationale |
|---|---|---|
| Success Rate | 30% | The most important metric — does it work? |
| Speed | 25% | Cache hits should make it faster over time |
| Cache Efficiency | 25% | Higher cache hit rate = more learning |
| Quality | 20% | Not just pass/fail — quality should improve too |
| Grade | Score | Meaning |
|---|---|---|
| A | ≥ 70 | Strong measurable improvement across dimensions |
| B | ≥ 50 | Clear improvement trend |
| C | ≥ 30 | Some improvement, system is learning |
| D | < 30 | Early stage — not enough data yet |
| Metric | Early Tasks | Recent Tasks | Delta |
|---|---|---|---|
| Success Rate | 86% | 100% | +17% |
| Extraction Speed | 10.1s | 2.7s | 3.7x faster |
| Cache Hit Rate | 14% | 83% | +69 percentage points |
| Recovery Needed | 86% | 17% | -81% |
These numbers emerge naturally as the vector cache fills with successful patterns. No manual tuning.
"Before we extract, we optionally send a DOM snapshot to Google Gemini for pre-analysis. Gemini suggests which CSS selectors are likely to contain the target data. During recovery, Gemini analyzes why the previous attempt failed and suggests alternative selectors. It's a second AI opinion before we commit to an extraction strategy."
Implemented in gemini.ts — uses Gemini 2.0-flash for speed:
| Function | Phase | What It Does |
|---|---|---|
geminiAnalyzePage() |
Before fresh extraction | Sends DOM snapshot + target description. Returns: suggested CSS selectors, recommended extraction strategy (direct, scroll_first, click_expand), and reasoning. |
getGeminiRecoveryStrategy() |
During recovery | Receives the failed attempt context. Analyzes why it failed. Suggests alternative selectors and a different approach. |
isGeminiAvailable() |
Startup | Checks API key availability. Gemini is optional — the system works without it but benefits from the second opinion. |
Gemini suggestions are appended as hints to the Stagehand extraction instruction, improving first-attempt success rates on unfamiliar pages.
"We use Browserbase for cloud browser infrastructure — every task runs in an isolated browser session with a debuggable live URL. Stagehand provides three levels of browser interaction: extract for structured AI extraction, act for clicking and interacting, and agent for full autonomous reasoning. The recovery pipeline uses all three."
Browserbase — session.ts
- Every task spawns an isolated cloud browser session — no local Chrome installation needed
- Each session has a live debuggable URL stored with the task for real-time viewing and post-mortem debugging
- Sessions are managed and cleaned up automatically
- The
/livedashboard page embeds the session viewer for real-time watching
Stagehand — stagehand-client.ts
Three levels of AI-powered browser interaction, each used in different parts of the pipeline:
| Method | Level | Usage in WebScout | Powered By |
|---|---|---|---|
stagehand.extract() |
Structured extraction | Primary extraction path. AI-powered data extraction with Zod schemas for typed, structured output. Used for both cached pattern replay and fresh extraction. | GPT-4o |
stagehand.act() |
Browser actions | Recovery strategy #2. Click consent buttons, dismiss cookie banners, close modals, scroll past overlays, accept GDPR popups — then re-extract. | GPT-4o |
stagehand.agent.execute() |
Autonomous reasoning | Recovery strategy #1. Full autonomous GPT-4o agent that reasons about the page, plans a multi-step approach, navigates complex flows. The most powerful but most expensive strategy. | GPT-4o |
page.goto(), page.screenshot(), page.evaluate() |
Standard Playwright | Navigation, screenshot capture for Weave traces, DOM evaluation for Gemini analysis | Playwright |
The recovery pipeline chains all three: try extract → act to remove blockers → extract with refined hints → agent for autonomous reasoning.
┌──────────────────────────────────────────────────────────────┐
│ WebScout Dashboard │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Live View│ │Evaluation│ │ Patterns │ │ Teaching │ │
│ │ (SSE) │ │ (Cohorts)│ │ (Browse) │ │ (Manual) │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Tasks │ │ Timeline │ │ Health │ │
│ │ (Manage) │ │ (Trace) │ │ (Status) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────┬─────────────────────────────────┘
│ Next.js API Routes (12 endpoints)
┌────────────────────────────▼─────────────────────────────────┐
│ Learning Engine │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ learningScrape() │ │
│ │ │ │
│ │ Vector Search → Cache Match → Gemini → Extract │ │
│ │ ↑ │ │ │ │
│ │ │ ↓ ↓ │ │
│ │ Pattern Store ← Learn ←──────────── Recovery │ │
│ │ ↑ ↑ │ │ │
│ │ │ Strategy Selector ────┘ │ │ │
│ │ │ │ │ │ │
│ │ Fitness Scoring Negative Learning Confidence Adj │ │
│ │ (Wilson+Decay) (Auto-Prune) (Asymmetric 4:1) │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────┬──────────────┬──────────────┬─────────────────────────┘
│ │ │
┌──────▼───┐ ┌─────▼──────┐ ┌───▼───────────┐
│ Redis │ │ Browserbase│ │ Weave │
│ Stack │ │ + Stagehand│ │ │
│ │ │ │ │ 7 Integration │
│ 6 Key │ │ 3 Interact │ │ Levels │
│ Patterns │ │ Modes │ │ │
│ │ │ │ │ Traces → │
│ • HNSW │ │ • extract │ │ Feedback → │
│ Vector │ │ • act │ │ Datasets → │
│ • Pattern │ │ • agent │ │ Images → │
│ Hashes │ │ │ │ Evaluations → │
│ • Task │ │ Live debug │ │ Context prop │
│ History │ │ URLs │ │ │
│ • Strategy│ └────────────┘ └───────────────┘
│ Stats │ │ │
│ • Confid. │ ┌─────▼──────┐ ┌───▼───────────┐
│ Thresh. │ │ OpenAI │ │ Vercel │
│ • Timeline│ │ │ │ │
└───────────┘ │ • GPT-4o │ │ • Next.js 15 │
│ (extract) │ │ • App Router │
┌─────│ • embed-3 │ │ • SSE streams │
│ │ (vectors) │ │ • Serverless │
│ │ • GPT-4o │ └───────────────┘
│ │ (quality) │
│ └────────────┘
│
┌─────▼──────┐
│ Gemini │
│ 2.0-flash │
│ │
│ • DOM pre- │
│ analysis │
│ • Recovery │
│ strategy │
└────────────┘
| Technology | Role | Technical Depth |
|---|---|---|
| Weave (W&B) | Observability + evaluation | 7 integration levels: weave.op() with custom summarize(), createInvocableOp() for Call ID capture, raw trace server API for retrospective feedback, versioned weave.Dataset for pattern evolution, weave.weaveImage() for inline screenshots, formal weave.Evaluation with 4 typed scorers, withWeaveAttributes() for context propagation |
| Redis Stack + RediSearch | Learning state + vector search | 6 key patterns: HNSW vector index (FLOAT32, 1536-dim, COSINE), pattern hashes with fitness fields, sorted set timeline, per-domain strategy stats with running averages, dynamic confidence threshold. All learning state persists across deployments. Sub-ms reads on hot path. |
| Browserbase | Cloud browser infrastructure | Isolated cloud browser sessions per task. Live debuggable URLs stored with each task. No local Chrome needed. Session management with automatic cleanup. Embedded viewer in dashboard /live page. |
| Stagehand v3 | AI browser interaction | 3 interaction modes: extract() with Zod schemas for typed structured output, act() for browser actions (click, scroll, dismiss), agent.execute() for autonomous GPT-4o reasoning. All three chained in recovery pipeline. |
| OpenAI GPT-4o | Extraction + quality | Powers Stagehand extraction and autonomous agent. assessExtractionQuality() provides 0-100 quality scores with text summaries. Scores feed into evaluation pipeline at 20% weight. |
| OpenAI text-embedding-3-small | Semantic vectors | 1536-dimensional embeddings for URL patterns and target descriptions. Enables generalization across semantically similar pages (amazon.com/products → amazon.com/electronics). |
| Google Gemini 2.0-flash | Second AI opinion | DOM pre-analysis suggests CSS selectors before extraction. Recovery analysis when primary extraction fails. Provides structural understanding from a different model's perspective. Optional — system works without it. |
| Next.js 15 + React 19 | Full-stack framework | App Router with 12 serverless API routes. SSE streaming via ReadableStream. 8 dashboard pages. React 19 with hooks for real-time state. |
| Tailwind CSS v4 + shadcn/ui | Dashboard UI | Dark-themed responsive dashboard. Cards, tables, badges, tabs. Recharts for learning curve and improvement visualizations. |
| Vercel | Deployment | Edge-optimized Next.js deployment. Serverless API routes handle task submission, SSE streaming, evaluation, and pattern management. |
- Node.js 18+
- Redis Stack (with RediSearch module — required for vector search)
- API keys: Browserbase, OpenAI, Weights & Biases, Google AI (Gemini is optional)
# Clone the repository
git clone https://github.com/nihalnihalani/WebScout.git
cd WebScout/webscout
# Install dependencies
npm install
# Set up environment variables
cp .env.example .env.local
# Fill in your API keys (see below)
# Start Redis Stack (with RediSearch)
docker compose up -d
# Or install locally:
# macOS: brew install redis-stack && redis-stack-server &
# Linux: See https://redis.io/docs/install/install-stack/
# Run the development server
npm run dev
# Seed demo data (optional — populates example tasks and patterns)
curl -X POST http://localhost:3000/api/demo/seedOpen http://localhost:3000 to access the dashboard.
# Required
BROWSERBASE_API_KEY= # browserbase.com/settings
BROWSERBASE_PROJECT_ID= # Browserbase dashboard
OPENAI_API_KEY= # platform.openai.com/api-keys
REDIS_URL= # Default: redis://localhost:6379
WANDB_API_KEY= # wandb.ai/authorize
# Optional
WEAVE_PROJECT= # Default: webscout
GOOGLE_AI_API_KEY= # aistudio.google.com/apikey (Gemini — optional but recommended)# Clear ALL data (tasks, patterns, strategy stats, confidence threshold)
curl -X POST http://localhost:3000/api/demo/reset
# Re-seed fresh demo data
curl -X POST http://localhost:3000/api/demo/seed| Page | Route | What You See |
|---|---|---|
| Home | / |
Task submission form. Enter URL + target description, watch it execute in real-time. |
| Dashboard | /dashboard |
Stats overview, learning curve chart (Recharts), recent tasks with status indicators |
| Live View | /live |
Embedded Browserbase session viewer — watch the agent navigate, click, and extract in real-time |
| Tasks | /tasks |
Task list with status, duration, cache hit/miss badges, recovery indicators |
| Task Detail | /tasks/[id] |
Full execution log: step-by-step trace with screenshots, timing, pattern used, quality score |
| Patterns | /patterns |
Pattern library grid — browse all learned patterns with fitness scores, success/failure counts, selectors |
| Evaluation | /evaluation |
Cohort comparison, improvement deltas, letter grade, weighted scoring breakdown |
| Teach | /teach |
Manual teaching mode — teach extraction patterns for specific sites by providing URL + selector + approach |
webscout/
├── src/
│ ├── app/ # Next.js App Router
│ │ ├── page.tsx # Home — task submission
│ │ ├── dashboard/page.tsx # Stats + learning curve
│ │ ├── live/page.tsx # Real-time browser view
│ │ ├── tasks/page.tsx # Task list
│ │ ├── tasks/[id]/page.tsx # Task detail + execution log
│ │ ├── patterns/page.tsx # Pattern library
│ │ ├── evaluation/page.tsx # Improvement metrics
│ │ ├── teach/page.tsx # Manual teaching mode
│ │ └── api/
│ │ ├── tasks/route.ts # POST: submit task, GET: list
│ │ ├── tasks/[id]/route.ts # GET: task detail
│ │ ├── tasks/[id]/stream/ # SSE: real-time updates
│ │ ├── patterns/route.ts # GET: learned patterns
│ │ ├── evaluation/route.ts # GET: cohort metrics
│ │ ├── evaluation/batch/ # POST: formal Weave eval
│ │ ├── teach/route.ts # POST: manual teaching
│ │ ├── health/route.ts # GET: service health
│ │ ├── metrics/route.ts # GET: aggregate stats
│ │ ├── timeline/route.ts # GET: task timeline
│ │ └── demo/
│ │ ├── seed/route.ts # POST: seed demo data
│ │ └── reset/route.ts # POST: clear everything
│ │
│ ├── lib/
│ │ ├── engine/ # Core learning engine
│ │ │ ├── scraper.ts # THE learning loop — learningScrape()
│ │ │ ├── recovery.ts # 4-strategy recovery pipeline
│ │ │ ├── pattern-fitness.ts # Wilson Score + time decay
│ │ │ ├── pattern-extractor.ts # Pattern storage + confidence threshold
│ │ │ ├── pattern-pruner.ts # Auto-prune dead patterns
│ │ │ └── strategy-selector.ts # Adaptive recovery ordering
│ │ │
│ │ ├── redis/ # Redis integration (6 key patterns)
│ │ │ ├── client.ts # Connection management
│ │ │ ├── vectors.ts # HNSW vector search + pattern CRUD
│ │ │ ├── patterns.ts # Pattern queries
│ │ │ └── tasks.ts # Task storage + progress flushing
│ │ │
│ │ ├── browser/ # Browserbase + Stagehand
│ │ │ ├── stagehand-client.ts # 3-mode Stagehand init
│ │ │ └── session.ts # Cloud session management
│ │ │
│ │ ├── ai/ # AI integrations
│ │ │ ├── gemini.ts # Gemini pre-analysis + recovery
│ │ │ └── openai-quality.ts # GPT-4o quality scoring
│ │ │
│ │ ├── embeddings/
│ │ │ └── openai.ts # text-embedding-3-small (1536-dim)
│ │ │
│ │ ├── tracing/ # Weave (7 levels)
│ │ │ ├── weave.ts # Ops, invoke, feedback, datasets
│ │ │ └── trace-context.ts # Screenshots + DOM snapshots
│ │ │
│ │ ├── evaluation/ # Improvement measurement
│ │ │ ├── batch-eval.ts # Formal Weave eval + 4 scorers
│ │ │ └── weave-eval-logger.ts # Prediction logging
│ │ │
│ │ └── utils/
│ │ ├── types.ts # TypeScript type definitions
│ │ └── url.ts # URL pattern utilities
│ │
│ ├── components/ # React UI (13 components + ui/)
│ │ ├── task-form.tsx # Task submission
│ │ ├── task-list.tsx # Task list + filters
│ │ ├── execution-log.tsx # Step-by-step viewer
│ │ ├── trace-timeline.tsx # Visual decision trace
│ │ ├── learning-timeline.tsx # Learning event timeline
│ │ ├── learning-curve.tsx # Recharts improvement charts
│ │ ├── improvement-report.tsx # Cohort comparison + grades
│ │ ├── pattern-card.tsx # Pattern with fitness score
│ │ ├── pattern-grid.tsx # Pattern library grid
│ │ ├── stats-overview.tsx # Aggregate statistics
│ │ ├── live-session-viewer.tsx # Browserbase session embed
│ │ ├── empty-state.tsx # Empty state placeholder
│ │ └── ui/ # shadcn/ui primitives
│ │
│ └── hooks/ # React hooks (8 hooks)
│ ├── use-task-stream.ts # SSE live updates
│ ├── use-tasks.ts # Task CRUD
│ ├── use-patterns.ts # Pattern fetching
│ ├── use-evaluation.ts # Evaluation data
│ ├── use-metrics.ts # Aggregate metrics
│ ├── use-live-task.ts # Live task tracking
│ ├── use-teach.ts # Teaching mode
│ └── use-timeline.ts # Timeline data
│
├── docker-compose.yml # Redis Stack with RediSearch
├── package.json
└── tsconfig.json
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/tasks |
Submit a scraping task. Body: { url, target, instruction? }. Returns task ID immediately, executes async via learning pipeline. |
GET |
/api/tasks |
List all tasks with stats. Query: ?limit=50 |
GET |
/api/tasks/[id] |
Full task detail: steps, screenshots, patterns, quality score |
GET |
/api/tasks/[id]/stream |
SSE stream — real-time step-by-step updates, polls Redis every 500ms |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/patterns |
All learned patterns with fitness scores, success/failure counts |
GET |
/api/evaluation |
Cohort-based improvement metrics with weighted score and letter grade |
POST |
/api/evaluation/batch |
Run formal Weave evaluation with all 4 typed scorers |
POST |
/api/teach |
Manually teach a pattern. Body: { url, target, selector, approach } |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/health |
Health checks: Redis connection, Browserbase API, Weave project, AI model availability |
GET |
/api/metrics |
Aggregate metrics: total tasks, success rate, avg duration, cache hit rate |
GET |
/api/timeline |
Task timeline for dashboard charts |
POST |
/api/demo/seed |
Populate demo data for showcasing |
POST |
/api/demo/reset |
Clear everything: tasks, patterns, strategy stats, confidence threshold, vector index |
MIT
WebScout — Built for WeaveHacks 3
Every failed click makes it smarter.
Weave | Redis | Browserbase | Stagehand | OpenAI | Gemini | Vercel