-
Notifications
You must be signed in to change notification settings - Fork 42
RFC: browser event capture #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,281 @@ | ||
| # RFC: Browser Event Capture | ||
|
|
||
| ## Summary | ||
|
|
||
| Add a configurable browser event streaming system to the image server that captures CDP events (console, network, DOM, layout shifts, screenshots, interactions), tags them with tab/frame context, and durably writes them to S2 streams for near-real-time multi-consumer access. Events are also available locally via an SSE endpoint. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Browser agents need real-time observability into what the browser is doing: console output, network traffic, DOM changes, navigation, layout shifts, and user interactions. Today there is no structured event stream from the image server. Agents rely on polling screenshots or manual CDP connections. | ||
|
|
||
| This system provides: | ||
|
|
||
| 1. **Fine-grained, configurable capture** -- choose exactly which event categories to record, with per-category options (e.g., network with or without response bodies). | ||
| 2. **Tab/iframe awareness** -- every event is tagged with target ID, CDP session ID, and frame ID so consumers can distinguish events from different tabs and iframes. | ||
| 3. **Smart waiting signals** -- computed meta-events (`network_idle`, `layout_settled`, `navigation_settled`) that are strictly more informative than Playwright's `networkidle` or `domcontentloaded`, enabling smarter wait strategies. | ||
| 4. **Durable streaming via S2** -- events are written to an S2 stream for multi-consumer near-real-time access. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ```mermaid | ||
| flowchart LR | ||
| Chrome[Chromium CDP] | ||
| Monitor[CDPMonitor goroutine] | ||
| RingBuf[Ring Buffer] | ||
| S2Writer[S2 Writer goroutine] | ||
| SSE["GET /events/stream SSE"] | ||
| S2Stream[S2 Stream] | ||
| Agents[Agents / Consumers] | ||
|
|
||
| Chrome -->|"WebSocket events"| Monitor | ||
| Monitor -->|"write"| RingBuf | ||
| RingBuf --> SSE | ||
| RingBuf --> S2Writer | ||
| S2Writer --> S2Stream | ||
| SSE --> Agents | ||
| S2Stream --> Agents | ||
| ``` | ||
|
|
||
| The CDPMonitor opens its own CDP WebSocket to Chrome (using the existing `UpstreamManager.Current()` URL) and subscribes to configured CDP domains. It normalizes events into a common schema, tags each with tab/frame/target context, and writes to a local ring buffer. The ring buffer is the single write path; consumers include the SSE endpoint (`GET /events/stream`) and an S2 writer goroutine that batches and appends events to an S2 stream. This decouples S2 latency from CDP event processing. | ||
|
|
||
| **CDP connection isolation**: Each CDP WebSocket connection gets its own `DevToolsSession` in Chrome with independent domain handler state. Enabling `Network` on the monitor's connection does not affect the user's CDP connection — events are dispatched only to the session that enabled the domain (confirmed from Chromium source: `devtools_session.cc`, `devtools_agent_host_impl.cc`). The overhead is one additional WebSocket + serialization of subscribed events. Benchmark under load once implemented. | ||
|
|
||
| Default state is **off**. An explicit `POST /events/start` is required to begin capture. | ||
|
|
||
| ## CDP Library Choice | ||
|
|
||
| Raw `coder/websocket` (already in `go.mod`). The protocol is just JSON-RPC over WebSocket: send `{id, method, params}`, receive events `{method, params, sessionId}` and responses `{id, result/error}`. This is the same approach the existing devtools proxy uses (`server/lib/devtoolsproxy/proxy.go`). No need for chromedp's abstraction layer since we're tapping events, not driving the browser. | ||
|
|
||
| Reference protocol definitions are in `./devtools-protocol/` (cloned from [ChromeDevTools/devtools-protocol](https://github.com/ChromeDevTools/devtools-protocol)). | ||
|
|
||
| ## Event Schema | ||
|
|
||
| Each event is a JSON record, capped at **1MB** (S2's record size limit): | ||
|
|
||
| ```go | ||
| type BrowserEvent struct { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how should a downstream consumer ensure event ordering? two events can share the same millisecond timestamp. also, how should consumers deduplicate events? (S2 provides at-least-once delivery, so duplicates are possible.) |
||
| Seq uint64 `json:"seq"` // monotonic sequence number, resets on server startup | ||
| Timestamp int64 `json:"ts"` // unix millis | ||
| Type string `json:"type"` // snake_case event name | ||
| TargetID string `json:"target_id,omitempty"` // CDP target ID (tab/window) | ||
| CDPSessionID string `json:"cdp_session_id,omitempty"` // CDP session ID (not Kernel session) | ||
| FrameID string `json:"frame_id,omitempty"` // CDP frame ID | ||
| ParentFrameID string `json:"parent_frame_id,omitempty"` // non-empty = iframe | ||
| URL string `json:"url,omitempty"` // URL context | ||
| Data json.RawMessage `json:"data"` // event-specific payload | ||
| Truncated bool `json:"truncated,omitempty"` // true if payload was cut to fit 1MB | ||
| } | ||
| ``` | ||
|
|
||
| The `seq` field provides total ordering within a capture session. Consumers can use `(seq, type, ts)` triples for deduplication (S2 provides at-least-once delivery). The counter is a `uint64` incremented atomically and resets when the server process restarts. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Deduplication strategy fails across server restartsMedium Severity The deduplication strategy using |
||
|
|
||
| ### Event Types | ||
|
|
||
| **Raw CDP events** (forwarded from Chrome, enriched with target/frame context): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as designed, each event type requires a custom transform from CDP |
||
|
|
||
| | Type | CDP Source | Key Fields in `data` | | ||
| |------|-----------|---------------------| | ||
| | `console_log` | Runtime.consoleAPICalled | level, text, args, stack_trace | | ||
| | `console_error` | Runtime.exceptionThrown | text, line, column, url, stack_trace | | ||
| | `network_request` | Network.requestWillBeSent | method, url, headers, post_data, resource_type, initiator | | ||
| | `network_response` | Network.responseReceived + getResponseBody | status, status_text, url, headers, mime_type, timing, body (truncated at ~900KB) | | ||
| | `network_loading_failed` | Network.loadingFailed | url, error_text, canceled | | ||
| | `navigation` | Page.frameNavigated | url, frame_id, parent_frame_id | | ||
| | `dom_content_loaded` | Page.domContentEventFired | — | | ||
| | `page_load` | Page.loadEventFired | — | | ||
| | `dom_updated` | DOM.documentUpdated | — | | ||
| | `target_created` | Target.targetCreated | target_id, url, type | | ||
| | `target_destroyed` | Target.targetDestroyed | target_id | | ||
| | `interaction_click` | Injected JS | x, y, selector, tag, text | | ||
| | `interaction_key` | Injected JS | key, selector, tag | | ||
| | `interaction_scroll` | Injected JS | from_x, from_y, to_x, to_y, target_selector | | ||
| | `layout_shift` | Injected PerformanceObserver | score, sources (element, previous_rect, current_rect) | | ||
| | `screenshot` | ffmpeg x11grab (full display) | base64 PNG in data | | ||
cursor[bot] marked this conversation as resolved.
Show resolved
Hide resolved
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Valid concern -- truncating base64 PNG data produces corrupt output. We don't support 4K displays so this is unlikely in practice, but the plan now specifies: if the base64 PNG exceeds ~950KB, downscale by halving dimensions and re-encode. This keeps a usable PNG under the 1MB S2 limit. Fixed in 7b9c491. |
||
|
|
||
| **Synthetic monitor events** (emitted by the monitor itself): | ||
|
|
||
| | Type | Trigger | Key Fields in `data` | | ||
| |------|---------|---------------------| | ||
| | `monitor_disconnected` | CDP WebSocket to Chrome closed (crash, restart) | reason | | ||
| | `monitor_reconnected` | CDP WebSocket re-established after disconnect | reconnect_duration_ms | | ||
|
|
||
| These events let consumers detect gaps in the event stream rather than silently missing events during Chrome restarts. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing state reset after Chrome crash reconnectionMedium Severity The spec defines Additional Locations (1) |
||
|
|
||
| **Computed meta-events** (emitted by the monitor's settling logic): | ||
|
|
||
| | Type | Trigger | | ||
| |------|---------| | ||
| | `network_idle` | Pending request count at 0 for 500ms after navigation | | ||
| | `layout_settled` | 1s of no layout-shift entries after page_load (timer resets on each shift) | | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch -- the table and description were contradictory. Fixed in 7b9c491: after |
||
| | `scroll_settled` | No scroll events for 300ms with >5px movement | | ||
| | `navigation_settled` | `dom_content_loaded` AND `network_idle` AND `layout_settled` all fired | | ||
|
|
||
| ### How Computed Events Work | ||
|
|
||
| **`network_idle`**: Counter incremented on `Network.requestWillBeSent`, decremented on `Network.loadingFinished` / `Network.loadingFailed`. After `Page.frameNavigated`, when counter hits 0, start a 500ms timer. If no new requests arrive in 500ms, emit `network_idle`. Reset on next navigation. | ||
|
|
||
| **`layout_settled`**: After `Page.loadEventFired`, inject a [`PerformanceObserver`](https://developer.mozilla.org/en-US/docs/Web/API/PerformanceObserver) watching for [`layout-shift`](https://developer.mozilla.org/en-US/docs/Web/API/LayoutShift) entries. This is a browser API that fires whenever visible elements move position without user input (e.g., an image loads and pushes text down, a font swap changes line heights, lazy content appears). Each shift entry has a `value` (0-1 score) and `sources` (which DOM nodes moved, from/to rects). Poll via `Runtime.evaluate` every 500ms. After `page_load`, start a 1s timer. Each time a layout shift is detected, reset the timer. When the timer expires (1s of quiet), emit `layout_settled`. For pages with zero layout shifts, this fires 1s after page_load. This captures visual stability that neither `networkidle` nor `domcontentloaded` can detect. | ||
|
|
||
| **`scroll_settled`**: The injected interaction tracking JS coalesces scroll events with a 300ms debounce. When scrolling stops for 300ms with >5px total movement, emit `scroll_settled`. | ||
|
|
||
| **`navigation_settled`**: Composite signal. After a navigation, track three booleans: `dom_content_loaded_fired`, `network_idle_fired`, `layout_settled_fired`. When all three are true, emit `navigation_settled`. This is strictly more informative than Playwright's `networkidle` or `domcontentloaded` because it also waits for visual stability. | ||
|
|
||
| ## API Endpoints | ||
|
|
||
| Consistent with existing prefix pattern (`/recording/`, `/process/`, `/computer/`, `/fs/`, etc.): | ||
|
|
||
| ### `POST /events/start` | ||
|
|
||
| Start event capture. Takes config body. If already running, reconfigures on the fly. Returns 200. | ||
|
|
||
| ```json | ||
| { | ||
| "console": true, | ||
| "network": true, | ||
| "network_response_body": true, | ||
| "navigation": true, | ||
| "dom": true, | ||
| "layout_shifts": true, | ||
| "screenshots": true, | ||
| "screenshot_triggers": ["error", "navigation_settled"], | ||
| "targets": true, | ||
| "interactions": true, | ||
| "computed_events": true | ||
| } | ||
| ``` | ||
|
|
||
| All fields default to `false`. A minimal call: | ||
|
|
||
| ```json | ||
| { "network": true } | ||
| ``` | ||
|
|
||
| ### `POST /events/stop` | ||
|
|
||
| Stop event capture. Returns 200. | ||
|
|
||
| ### `GET /events/stream` | ||
|
|
||
| SSE stream of events from local ring buffer. Returns `text/event-stream`. Each SSE event includes: | ||
|
|
||
| - `id: <seq>` -- the event's sequence number, enabling `Last-Event-ID` reconnection | ||
| - `data: <BrowserEvent JSON>` -- one `BrowserEvent` per SSE event | ||
|
|
||
| Clients can reconnect with `Last-Event-ID` to resume from where they left off (subject to ring buffer capacity). | ||
|
|
||
| ### Config Schema | ||
|
|
||
| ```yaml | ||
| EventCaptureConfig: | ||
| type: object | ||
| properties: | ||
| console: | ||
| type: boolean | ||
| description: Capture console logs and exceptions | ||
| network: | ||
| type: boolean | ||
| description: Capture network requests and responses | ||
| network_response_body: | ||
| type: boolean | ||
| description: Include response bodies (up to ~900KB, truncated beyond). Requires network=true | ||
| navigation: | ||
| type: boolean | ||
| description: Capture page navigation and load events | ||
| dom: | ||
| type: boolean | ||
| description: Capture DOM update events | ||
| layout_shifts: | ||
| type: boolean | ||
| description: Inject PerformanceObserver for layout shift detection | ||
| screenshots: | ||
| type: boolean | ||
| description: Capture full-display screenshots at key moments | ||
| screenshot_triggers: | ||
| type: array | ||
| items: | ||
| type: string | ||
| enum: [error, page_load, navigation_settled, scroll_settled, network_idle] | ||
| description: Which events trigger a screenshot. Default [error, navigation_settled] | ||
| targets: | ||
| type: boolean | ||
| description: Capture target (tab/window) creation/destruction | ||
| interactions: | ||
| type: boolean | ||
| description: Inject JS to track clicks, keys, scrolls | ||
| computed_events: | ||
| type: boolean | ||
| description: Emit computed meta-events (network_idle, layout_settled, scroll_settled, navigation_settled) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Computed events have undocumented config dependenciesMedium Severity The Additional Locations (1) |
||
| ``` | ||
|
|
||
| ## Multi-Target via setAutoAttach | ||
|
|
||
| To monitor all tabs and iframes, the monitor calls `Target.setAutoAttach` with `{autoAttach: true, waitForDebuggerOnStart: false, flatten: true}` on the browser-level CDP session. With `flatten: true`, all events from child targets arrive on the same WebSocket connection annotated with `sessionId`. The monitor maintains a `sessionId -> targetInfo` map (populated from `Target.targetCreated` / `Target.attachedToTarget` events) to enrich each event with target context (URL, type, targetId). The CDP `sessionId` is mapped to the `cdp_session_id` field in `BrowserEvent`. | ||
|
|
||
| ## Screenshots | ||
|
|
||
| Full-display screenshots using the existing ffmpeg x11grab approach (same as `TakeScreenshot` in `computer.go`). The PNG is base64-encoded and placed in the event `data` field. A typical 1920x1080 PNG screenshot is ~200-500KB base64, well under the 1MB S2 limit. If a screenshot exceeds ~950KB base64 (e.g., unusually complex screen content), downscale the image by halving dimensions and re-encode before embedding. This keeps the event under S2's 1MB record limit while preserving a usable PNG (never truncate binary data). Screenshots are triggered by configurable events (default: `error`, `navigation_settled`). | ||
|
|
||
| ## S2 Integration | ||
|
|
||
| - **New dependency**: `github.com/s2-streamstore/s2-sdk-go` (v0.11.8, same as kernel repo) | ||
| - **Config env vars** (in `server/cmd/config/config.go`): | ||
| - `S2_ACCESS_TOKEN` -- S2 access token (optional; if absent, S2 writes are skipped) | ||
| - `S2_BASIN` -- S2 basin name | ||
| - `S2_STREAM_NAME` -- stream name for browser events | ||
| - **Write path**: The S2 writer is a consumer of the ring buffer, just like SSE clients. It reads events from the ring buffer, batches them (every 100ms or 50 events, whichever comes first), and calls `streamClient.Append()` with `[]AppendRecord`. Each record body is the JSON-serialized `BrowserEvent`. This single-write-path design means the CDP monitor never blocks on S2 latency. | ||
| - **Graceful degradation**: If S2 config is not provided, the S2 writer goroutine is not started. The ring buffer and SSE endpoint still work. | ||
|
|
||
| ## Files to Create / Modify | ||
|
|
||
| ### New Files | ||
|
|
||
| | File | Purpose | | ||
| |------|---------| | ||
| | `server/lib/cdpmonitor/monitor.go` | Core: raw coder/websocket CDP client, domain enablement, setAutoAttach, event dispatch loop | | ||
| | `server/lib/cdpmonitor/events.go` | BrowserEvent struct, event type constants, JSON serialization, 1MB truncation | | ||
| | `server/lib/cdpmonitor/config.go` | EventCaptureConfig struct, validation, reconfiguration | | ||
| | `server/lib/cdpmonitor/settling.go` | Network idle state machine, layout shift observer injection/polling, composite navigation_settled | | ||
| | `server/lib/cdpmonitor/interactions.go` | JS injection for click/key/scroll tracking, 500ms polling, scroll 300ms debounce | | ||
| | `server/lib/cdpmonitor/screenshot.go` | Full-display screenshot via ffmpeg x11grab, base64 encode, triggered by event hooks | | ||
| | `server/lib/cdpmonitor/s2writer.go` | Batched S2 append writer, graceful degradation | | ||
| | `server/lib/cdpmonitor/buffer.go` | Ring buffer for local SSE subscribers | | ||
| | `server/cmd/api/api/events.go` | HTTP handlers for /events/start, /events/stop, /events/stream | | ||
|
|
||
| ### Modified Files | ||
|
|
||
| | File | Changes | | ||
| |------|---------| | ||
| | `server/openapi.yaml` | Add POST /events/start, POST /events/stop, GET /events/stream endpoints | | ||
| | `server/cmd/api/api/api.go` | Add CDPMonitor field to ApiService | | ||
| | `server/cmd/api/main.go` | Wire up CDPMonitor with optional S2 client | | ||
| | `server/cmd/config/config.go` | Add S2_ACCESS_TOKEN, S2_BASIN, S2_STREAM_NAME env vars | | ||
| | `server/go.mod` | Add s2-sdk-go dependency | | ||
|
|
||
| ## Testing Plan | ||
|
|
||
| ### Unit Tests (`server/lib/cdpmonitor/*_test.go`) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the test plan covers happy paths well but doesn't mention failure modes: Chrome crash/restart during capture, ring buffer overflow under high event volume, or calling |
||
|
|
||
| | File | Coverage | | ||
| |------|----------| | ||
| | `events_test.go` | Event serialization, 1MB truncation (verify truncated flag set, payload under limit), snake_case type validation | | ||
| | `config_test.go` | Config validation, defaults, reconfiguration merging, network_response_body requires network | | ||
| | `settling_test.go` | Network idle state machine (request counting, 500ms timer, reset on navigation), layout settled 1s timer, composite navigation_settled requires all 3 signals | | ||
| | `buffer_test.go` | Ring buffer overflow, subscriber catch-up, concurrent read/write safety | | ||
| | `s2writer_test.go` | Time-based and count-based flush batching, graceful skip when S2 not configured | | ||
|
|
||
| ### Integration Tests (`server/e2e/`) | ||
|
|
||
| Tests are grouped to minimize container overhead. Each test function runs in a shared container. | ||
|
|
||
| | File | Scenarios Covered | | ||
| |------|-------------------| | ||
| | `e2e_events_core_test.go` | **Lifecycle**: start/stop/restart capture. **Reconfigure**: start with network-only, verify no console events, reconfigure to add console, verify console events appear. **Console**: navigate to page with console.log/console.error, verify `console_log` and `console_error` events. **Network**: navigate to page that fetches an API, verify `network_request` + `network_response`, test with response bodies enabled, test large response truncation. | | ||
| | `e2e_events_navigation_test.go` | **Navigation & settling**: navigate between pages, verify `navigation`, `dom_content_loaded`, `page_load` events. Verify `network_idle`, `layout_settled`, `navigation_settled` fire in correct order. **Iframes**: load page with iframe, verify events carry correct `frame_id` and `parent_frame_id`. **Screenshots**: configure screenshot on `navigation_settled`, verify `screenshot` event with base64 PNG data. | | ||
| | `e2e_events_targets_test.go` | **Multi-target (setAutoAttach)**: open new tab via `window.open()`, verify `target_created` with correct URL and distinct `cdp_session_id`. Navigate in second tab, verify events attributed correctly. Close tab, verify `target_destroyed`. **Interactions**: click element, type in input, scroll page; verify `interaction_click`, `interaction_key`, `interaction_scroll`, `scroll_settled` events. | | ||
| | `e2e_events_failure_test.go` | **Chrome crash/restart**: kill Chrome process during active capture, verify `monitor_disconnected` event with reason, verify automatic reconnection and `monitor_reconnected` event, verify domain re-subscription and events resume. **Ring buffer overflow**: generate high event volume (e.g., tight network request loop), verify oldest events are evicted without crash, verify SSE clients receive latest events. **Start before Chrome ready**: call `/events/start` before Chrome has finished launching, verify graceful error response (503) or queued start that activates once Chrome is available. | | ||
|
|
||
| ## Appendix: Prior Art | ||
|
|
||
| - [dev3000 CDPMonitor](./dev3000/src/cdp-monitor.ts) -- TypeScript implementation of CDP event capture using raw `ws` WebSocket. Covers console, network, navigation, DOM, interactions (injected JS), and screenshot triggers. Connects to a single page target. | ||
| - [dev3000 ScreencastManager](./dev3000/src/screencast-manager.ts) -- Passive screencast capture and CLS detection using injected PerformanceObserver. Captures layout shift sources with element/rect details. | ||
| - [kernel API S2 usage](https://github.com/onkernel/kernel/tree/main/packages/api/lib/s2util) -- Go patterns for S2 read/write sessions using `s2-sdk-go`. | ||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when Chrome crashes and restarts mid-capture, the monitor's WebSocket dies and events are lost until reconnect. consider emitting synthetic
monitor_disconnected/monitor_reconnectedevents so consumers know there's a gap in the stream rather than silently missing events.