Watch and Learn

A Docker Compose application that lets you watch and interact with a browser controlled by an AI agent. Features hybrid control where you can both chat with the agent and directly interact with the browser.

Architecture

System Overview

                          ┌─────────────────────────────────────────────┐
                          │              User Browser (:3000)           │
                          │                                             │
                          │   ┌─ Observe tab ─┐  ┌─── Chat panel ───┐  │
                          │   │ MJPEG stream   │  │ WebSocket msgs   │  │
                          │   └────────────────┘  └──────────────────┘  │
                          │   ┌─ Control tab ──┐                        │
                          │   │ noVNC (direct) │                        │
                          │   └────────────────┘                        │
                          └──────────┬──────────────────┬───────────────┘
                                     │                  │
                       VNC (ws:6080) │                  │ WebSocket (:8000/ws)
                      MJPEG (ws:8765)│                  │
                                     ▼                  ▼
┌──────────────────────────────────────────┐  ┌──────────────────────────────────────┐
│         playwright-browser               │  │           python-agent               │
│                                          │  │                                      │
│  Xvfb (virtual display :99)             │  │  FastAPI server (:8000)              │
│  Chromium (CDP on :9222)                │  │  BrowserAgent (agentic loop)         │
│  x11vnc → websockify (:6080)           │  │  MCP Client ──── SSE (:3001) ───────►│
│  FFmpeg → video_server.py               │  │  RAG Retriever                       │
│    ├─ MJPEG stream    (:8765)           │  │                                      │
│    └─ Frame query API (:8766)           │  │         ┌────────┴────────┐          │
│  Playwright MCP server (:3001/sse)      │  │         ▼                 ▼          │
│                                          │  │  ┌───────────┐   ┌──────────────┐   │
└──────────────────────────────────────────┘  │  │ VoyageAI  │   │ MongoDB Atlas│   │
                                              │  │ (embed +  │   │ (recordings +│   │
                                              │  │  rerank)  │   │ vector search│   │
                                              │  └───────────┘   └──────────────┘   │
                                              │         │                            │
                                              │         ▼                            │
                                              │  ┌──────────────┐                    │
                                              │  │ Google Gemini│                    │
                                              │  │ (LLM)       │                    │
                                              │  └──────────────┘                    │
                                              └──────────────────────────────────────┘

Services

playwright-browser — The virtual browser environment. Runs a headless Chromium inside a virtual X11 display (Xvfb). x11vnc + websockify expose VNC over WebSocket for direct user interaction. FFmpeg captures the display and video_server.py streams it as MJPEG frames (plus a rolling buffer for historical frame queries). The Playwright MCP server exposes browser automation tools (navigate, click, type, screenshot, snapshot) over SSE.

python-agent — The AI engine. A FastAPI server that accepts chat messages over WebSocket. The BrowserAgent runs an agentic loop: it sends the user message (plus optional RAG context) to Google Gemini, which responds with structured JSON. If the response contains a tool call, the agent executes it via the MCP client, feeds the result back to Gemini, and repeats (up to 30 iterations). Supports session recording and RAG-based learning from past recordings via MongoDB Atlas + VoyageAI embeddings.

nextjs-webapp — The frontend. A Next.js 14 app with two view modes: Observe (read-only MJPEG video stream) and Control (interactive noVNC session). A chat panel connects to the python-agent over WebSocket. A Teach mode lets users record sessions with screenshots and save them with metadata for the RAG pipeline.

Data Flow (Chat Message)

1. User types message in ChatWindow
       │
       ▼  WebSocket
2. python-agent receives message
       │
       ▼
3. RAG context retrieved (if MongoDB + VoyageAI configured)
   ├─ User query → VoyageAI embedding → MongoDB cosine similarity
   ├─ Top-K results → VoyageAI reranking → Top-3
   └─ Load & downsample screenshots → multimodal context
       │
       ▼
4. [user message + RAG context + screenshot] → Gemini LLM
       │
       ▼
5. Gemini responds: { thinking, tool_call, user_message }
       │
       ├─── If tool_call present ──────────────────────┐
       │    Execute via MCP client → Playwright         │
       │    Capture screenshot (if recording)           │
       │    Feed result + screenshot back to Gemini     │
       │    └─── Loop back to step 5 (max 30 iters) ───┘
       │
       ▼
6. Collected user_message fields → WebSocket → ChatWindow

Key Design Patterns

Agentic loop — agent.py:process_message() iterates: LLM produces structured JSON, tool calls are executed silently via MCP, results fed back, until the LLM stops requesting tools or limits are hit (30 iterations, 3 retries per step).
Shared MCP client — A single persistent SSE connection to the Playwright MCP server is created at startup and shared across all WebSocket sessions, preventing stale element references.
Exponential backoff — API calls to Gemini retry with jitter on 429 rate limits (agent.py:_send_message_with_retry).
RAG pipeline — Two-stage retrieval: vector similarity (K-NN via MongoDB Atlas) then semantic reranking (VoyageAI). Retrieved screenshots are downsampled and injected as multimodal context.
Frame buffering — video_server.py maintains a rolling 500ms buffer of MJPEG frames, enabling historical frame queries (used during recording to capture pre-action screenshots).

File Guide (Suggested Reading Order)

Frontend — `services/nextjs-webapp/`

#	File	What it does
1	`src/app/layout.tsx`	Root layout (HTML shell, metadata)
2	`src/app/page.tsx`	Main page — Observe/Control mode toggle, Teach button, 2-panel layout
3	`src/components/ChatWindow.tsx`	Chat UI, WebSocket connection to agent, recording state management
4	`src/components/VideoViewer.tsx`	MJPEG stream display (Observe mode)
5	`src/hooks/useMJPEGVideoStream.ts`	Custom hook — connects to MJPEG WebSocket, decodes binary frames
6	`src/components/VNCViewer.tsx`	noVNC remote desktop embed (Control mode)
7	`src/components/RecordingMetadataModal.tsx`	Modal dialog for saving recording description + tags

Browser — `services/playwright-browser/`

#	File	What it does
8	`entrypoint.sh`	Startup orchestrator — launches Xvfb, x11vnc, websockify, video server, Chromium, Playwright MCP
9	`video_server.py`	MJPEG streaming server + rolling frame buffer + HTTP frame query API

Agent Core — `services/python-agent/`

#	File	What it does
10	`config.py`	Centralized constants — model names, RAG parameters, image sizes
11	`models.py`	Pydantic models — `AgentResponse`, `ToolCall` (structured LLM output schema)
12	`prompts.py`	System prompt builder — formats MCP tool schemas for Gemini
13	`mcp_client.py`	MCP client wrapper — persistent SSE connection, `call_tool()`, `list_tools()`
14	`main.py`	FastAPI app — WebSocket `/ws` endpoint, recording endpoints, startup init
15	`agent.py`	Core `BrowserAgent` — agentic loop, tool execution, retry logic, response parsing

RAG & Storage — `services/python-agent/`

#	File	What it does
16	`recording_models.py`	Pydantic models — `RecordingSession`, `RecordingStep`, `RetrievedRecording`
17	`voyage_service.py`	VoyageAI wrapper — `embed_document()`, `embed_query()`, `rerank()`
18	`recording_storage.py`	MongoDB connection — `save_recording()`, `search_similar()` (vector search)
19	`rag_retriever.py`	RAG pipeline — retrieval + reranking + image loading/downsampling

Reference

#	File	What it does
20	`james-mongoDB-voyageAI.md`	Detailed guide on the MongoDB + VoyageAI RAG integration

Prerequisites

Docker and Docker Compose
Google Gemini API key

Quick Start

Clone the repository
```
git clone <repo-url>
cd watch-and-learn
```

Set up environment variables

cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

Build and start the services
```
docker-compose up --build
```
Open the application
- Navigate to http://localhost:3000
- You'll see the browser view on the left and chat on the right

Services

Service	Port	Description
nextjs-webapp	3000	Main web interface
playwright-browser	6080	noVNC web interface (direct access)
python-agent	8000	WebSocket API for agent

Usage

Chat with the Agent

Type commands in the chat window to control the browser:

"Go to google.com"
"Search for weather in New York"
"Click on the first link"
"What's on the screen?"

Direct Browser Interaction

Click and type directly in the browser view panel. The AI agent can observe and respond to your actions.

Development

Running services individually

Playwright Browser:

cd services/playwright-browser
docker build -t playwright-browser .
docker run -p 6080:6080 -p 3001:3001 playwright-browser

Next.js Webapp:

cd services/nextjs-webapp
npm install
npm run dev

Python Agent:

cd services/python-agent
pip install -r requirements.txt
GEMINI_API_KEY=your_key python main.py

Environment Variables

Variable	Description	Required
GEMINI_API_KEY	Google Gemini API key	Yes

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
demo		demo
services		services
test_cases/competitor_analysis		test_cases/competitor_analysis
voyageai-chat		voyageai-chat
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
james-mongoDB-voyageAI.md		james-mongoDB-voyageAI.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Watch and Learn

Architecture

System Overview

Services

Data Flow (Chat Message)

Key Design Patterns

File Guide (Suggested Reading Order)

Frontend — `services/nextjs-webapp/`

Browser — `services/playwright-browser/`

Agent Core — `services/python-agent/`

RAG & Storage — `services/python-agent/`

Reference

Prerequisites

Quick Start

Services

Usage

Chat with the Agent

Direct Browser Interaction

Development

Running services individually

Environment Variables

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Watch and Learn

Architecture

System Overview

Services

Data Flow (Chat Message)

Key Design Patterns

File Guide (Suggested Reading Order)

Frontend — services/nextjs-webapp/

Browser — services/playwright-browser/

Agent Core — services/python-agent/

RAG & Storage — services/python-agent/

Reference

Prerequisites

Quick Start

Services

Usage

Chat with the Agent

Direct Browser Interaction

Development

Running services individually

Environment Variables

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Frontend — `services/nextjs-webapp/`

Browser — `services/playwright-browser/`

Agent Core — `services/python-agent/`

RAG & Storage — `services/python-agent/`

Packages