Skip to content

gesposito/chrome-diopside

Repository files navigation

chrome-diopside

A 270M fine-tune of FunctionGemma for Chrome DevTools Protocol tool-calling, plus an AX-tree grounding layer, shipped as a Chrome side-panel extension. Type (or speak) a prompt into the side panel; a local Bun server runs inference → grounds the call against the live accessibility tree → drives the active tab via chrome.debugger. All model inference runs locally via ONNX.

chrome-diopside-resized.mp4

Setup

bun install
git submodule update --init   # devtools-protocol + unslothai-notebooks + cerebriumai-examples
cp .env.example .env          # fill in HF_TOKEN, WANDB_API_KEY, CEREBRIUM_API_KEY, KERNEL_API_KEY

Bun auto-loads .env for all local scripts; Python scripts inherit it via child-process spawning.

Pipeline

# ── Dataset ──────────────────────────────────────────────────
bun run generate              # synthesize examples via Claude (Anthropic API)
bun run build                 # commands + tools catalog from the CDP spec
bun run push-dataset          # upload to HuggingFace Hub

# ── Training ─────────────────────────────────────────────────
bun run deploy                # push main.py + deps to Cerebrium (only when code/deps change)
bun run train --name v22      # single run
bun run sweep                 # parallel random-search sweep (see caveat below)
bun run watch v22             # poll W&B until the run terminates
bun run pipeline              # Ink TUI: navigate stages, fire steps, auto-fill W&B IDs

# ── Evaluation ───────────────────────────────────────────────
bun run eval:remote           # trigger string-match eval on Cerebrium GPU (async)
bun run eval:fetch            # fetch results + write report
bun run eval:exec             # execution-based eval against real Chrome
bun run eval:exec --grounding # …with AX-tree grounding applied to predicted calls
bun run eval:analyze          # confusion pairs + FP breakdown (diffs vs baseline if set)
bun run eval:baseline         # snapshot current report as the baseline
bun run bench:ground          # grounding benchmark on static HTML fixtures
bun run bench:ground:baseline # snapshot the grounding bench

# ── Ship ─────────────────────────────────────────────────────
bun run push <wandb_run_id>   # push winning LoRA to HuggingFace Hub
bun run convert <wandb_id>    # merge LoRA + export ONNX on Cerebrium
bun run infer "message"       # local inference, pure Bun + ONNX
bun run try --url <url> "msg" # end-to-end: infer + ground + execute on a real page
bun run observe --url <url> "query" # list AX-tree candidates matching a query (no model)
bun run release <name> <wandb_id>   # freeze release manifest (SHAs + metrics)

# ── Runtime ──────────────────────────────────────────────────
bun run serve                 # inference server only: HTTP /infer + WS /ws
bun run ext                   # extension dev build (WXT, loads as MV3 unpacked)
bun run dev                   # both in parallel

Dataset changes only need bun run push-dataset — no redeploy required. The Cerebrium image downloads the dataset from HuggingFace Hub at runtime. Code changes (main.py, cerebrium.toml) or Python dep changes need bun run deploy.

Dataset

Synthesis

scripts/generate-dataset.ts asks Claude (default claude-sonnet-4-6, temp 0.7) to produce a batch of (user_intent, tool_call) pairs for each of the 26 user-facing CDP tools. Per-tool prompts carry:

  • description — what the tool does
  • examples — 5-20 canonical natural-language phrasings (seed examples used as few-shot)
  • paramHints — what each parameter means, valid ranges, and which are required
  • constraint — tight disambiguation from neighbouring tools (e.g. Input.dispatchKeyEvent vs Page.handleJavaScriptDialog, Page.printToPDF vs Emulation.setEmulatedMedia)

Each per-tool request is stored in dataset/formatted/generation-manifest.json with its Anthropic response_id, so any batch is traceable to its source API call.

Tools catalog

dataset/formatted/tools.json is a flat list of {name, description, parameters} objects built from the CDP spec in vendor/devtools-protocol. 28 tools total: the 26 user-facing ones plus 2 the model needs to recognise from plumbing (Target.createTarget, Target.attachedToTarget).

Formatted output

File Description
single-step.json 1 user intent → 1 tool call — the only split currently used for training
cdp-commands.json Unique CDP commands referenced in the dataset
tools.json Tool definitions with parameters from the CDP spec
generation-manifest.json Per-tool audit trail (model, temp, response_ids)

Training data format:

[
  { "role": "user", "content": "click on \"#submit\"" },
  {
    "role": "assistant",
    "tool_calls": [
      {
        "type": "function",
        "function": {
          "name": "Input.dispatchMouseEvent",
          "arguments": { "type": "mousePressed", "x": 10, "y": 20, "button": "left" }
        }
      }
    ]
  }
]

Known limitation. The model emits one tool call per inference. Compound prompts ("open a new tab and then go to github") are handled pre-model by src/split-prompt.ts (connectives + verb-led and, with quoted-text masking). multi-step.json is emitted as an empty placeholder (downstream code expects the path to exist); multi-turn conversations are out of scope for the current generator.

Training

Fine-tuning runs on Cerebrium (serverless GPU) with Weights & Biases for experiment tracking.

Architecture

┌──────────────────────────┐         ┌──────────────────────────┐
│ run.py (your Mac)        │         │ Cerebrium GPU replica    │
│                          │         │                          │
│  W&B sweep agent         │ HTTP    │  main.py                 │
│  ├─ pick hyperparams ────┼────────►│  └─ train_model(params)  │
│  ├─ trigger experiment   │ async   │     ├─ load FunctionGemma│
│  └─ poll W&B for results │         │     ├─ download dataset  │
│                          │ ◄───────┤     ├─ SFTTrainer w/ LoRA│
│                          │ wandb   │     ├─ save LoRA         │
└──────────────────────────┘         │     └─ report to W&B     │
                                     └──────────────────────────┘
  • main.py exposes train_model, run_eval_inference, get_eval_predictions, push_lora, convert_to_onnx, merge_and_push. Cerebrium turns each into an HTTP endpoint.
  • run.py runs locally. Defines a W&B sweep config, then for each experiment posts the chosen hyperparameters to the Cerebrium endpoint asynchronously.
  • cerebrium.toml — GPU type (A10), scaling (0–3 replicas), 1-hour response grace period for long runs.
  • pyproject.toml + uv.lock — Python deps via uv. requirements.txt is auto-generated from the lockfile for Cerebrium.

The dataset is not bundled in the Cerebrium image. It lives on HuggingFace Hub and is downloaded at runtime. Dataset changes only need bun run push-dataset, not a full bun run deploy.

Setup

  1. Accounts — sign up for Cerebrium, W&B, accept the FunctionGemma license on HuggingFace.
  2. HuggingFace token with Write access — needed to download FunctionGemma and to upload trained adapters. Create at https://huggingface.co/settings/tokens.
  3. Local env varscp .env.example .env. The same keys need to be set as Cerebrium secrets (their dashboard) so main.py can use them at training time:
    HF_TOKEN          # write-enabled
    WANDB_API_KEY
    HF_DATASET_REPO   # e.g. gesposito/chrome-diopside-dataset
    
  4. Cerebrium CLI: uv tool install cerebrium && uv tool run cerebrium login.

Python dependencies

Managed with uv; the lockfile is source of truth.

uv add torch          # → updates pyproject.toml + uv.lock + installs
uv remove peft
uv sync               # install from lockfile
uv run python main.py

Never use uv pip — it bypasses the lockfile and gets wiped on the next uv sync.

Deps split into two groups in pyproject.toml:

  • Base (uv sync) — torch, transformers, peft, wandb. For local eval.
  • Cerebrium (uv sync --extra cerebrium) — adds unsloth, trl, optimum. For local training.

bun run deploy auto-exports the full cerebrium deps to requirements.txt.

Baked-in tool knowledge (compressed schemas)

The 28-tool catalog in JSON Schema form expanded past the 270M model's 8k context window after the chat template wrapped each string value in <escape> tags. Earlier revisions had 63 tools and the problem was more acute.

Earlier versions worked around this by subsetting tools per example (correct tool + random distractors) with an embedding-based pre-filter at inference. That created a distribution mismatch: training saw 4 tools, inference saw 8 embedding-picked ones — costing ~20 points of accuracy.

compress_tools() in main.py strips parameter schemas, keeping only {name, description} with required-param names appended to the description. Per-tool tokens drop from ~1700 → ~100, so the full catalog fits in every prompt. The model learns parameter formats from examples rather than from the prompt — the pattern Gorilla and ToolLLaMA use for small models on fixed APIs.

Approach Tools per prompt Train ↔ inference match Tool match (exec)
Subsetting + embedding pre-filter (legacy) ~8 (semantic) mismatched 62.6%
Baked-in compressed schemas Full catalog matched 95.7%

Important. Examples must be text-formatted before being wrapped in a HuggingFace Dataset. Dataset.from_list() infers a unified schema and pads missing keys with None, silently reintroducing null pollution even if the source JSON is clean. The [dataset] sanity check log in main.py catches this.

Defaults that work

Baked into main.py, verified on L40 (48 GB):

Setting Value Why
batch_size 1 Larger batches OOM with long sequences
gradient_accumulation_steps 8 Effective batch size of 8
lora_r 32 Sufficient for 270M
lora_alpha 64 2× rank
max_seq_length 8192 Fits compressed tools + conversation
num_train_epochs 3 Loss plateaus around epoch 2-3
learning_rate 2e-4 Standard for LoRA on small models

Single run

bun run train --name v22
bun run train --name v22 --hub-repo gesposito/chrome-diopside
bun run train --name v22 --lora-r 64

Fires train_model async on Cerebrium. Monitor via bun run watch v22 or bun run pipeline.

Sweep

bun run sweep              # 5 experiments
bun run sweep --count 10

Each run trains in parallel up to max_replicas in cerebrium.toml. Replicas scale to zero between runs — pay only for actual GPU seconds.

Known limitation: this is parallel random search, not Bayesian optimisation. The W&B agent creates a local run, fires an async Cerebrium call, and closes immediately — training metrics report to a separate W&B run on Cerebrium, so the optimiser never sees results. The fix is to pass the sweep run id to Cerebrium so main.py resumes it instead of creating a new run; see the run.py docstring.

Sweep runs don't push to HuggingFace Hub (hub_repo is omitted). Each LoRA saves to Cerebrium's persistent volume only. After the sweep:

  1. Compare runs in W&B
  2. Pick the winner using eval.ts accuracy metrics, not eval/loss (lower loss ≠ higher accuracy)
  3. bun run push <wandb_run_id> to promote

Push and convert

bun run push lpm34vo5                          # W&B run ID
bun run push lpm34vo5 --repo gesposito/my-model
bun run convert lpm34vo5                       # merge LoRA + export ONNX
bun run convert --path /persistent-storage/.../lora

convert runs on Cerebrium — it merges the LoRA, exports ONNX, and uploads to {hub_repo}-onnx. No local Python needed.

Evaluation

String-match eval

Every test example sees the full compressed tools catalog — the same prompt the model saw during training.

# Remote: two-step async flow
bun run eval --remote           # triggers base + fine-tuned inference (async)
bun run eval --remote kdi0mc6u  # same, for a specific W&B run
bun run eval --fetch            # reads pending.json, fetches both runs, judges, reports

# Local: from HuggingFace Hub
bun run eval --hub-repo your-username/chrome-diopside-cdp --python .venv/bin/python
bun run eval --lora-dir ./model --python .venv/bin/python

Base predictions cache in outputs/eval/base-predictions.json:

Scenario What happens
--remote Fresh base run on Cerebrium (no local cache used)
--fetch Fetches base predictions from pending.json
Local + cache exists Reused (fast)
Local + cache missing Generated locally via Python (~10 min on Apple Silicon)
--refresh-base (local) Regenerated regardless of cache

Outputs: outputs/eval/report.{md,json}.

Execution-based eval

String match tells you if the model picks the right tool; execution eval tells you if the generated CDP commands actually run.

bun run eval:exec                     # full test set (~280 examples)
bun run eval:exec --max-examples 20   # quick subset
bun run eval:exec --predictions <p>   # skip inference, use saved predictions
bun run eval:exec --grounding         # apply AX-tree grounding to predicted calls

Spawns a headless Chrome via Bun.WebView and runs both expected and predicted commands per example. Reports tool match, execution success, and false positives (right tool, bad params). Output at outputs/eval/exec-report.{md,json}.

Some commands need browser state the model can't produce (focused inputs, open dialogs, resolved nodeIds). getSetupCommands() in eval-exec.ts navigates to a setup page per method; resolveRuntimeArgs() injects runtime values (history entryId, etc.) into the command. Destructive commands (Page.close, Browser.close) skip execution and recreate the browser.

Tools in SKIP_EXECUTION — string-matched but not actually run — so they don't inflate FP:

  • Page.navigateToHistoryEntry — needs a real browsing history
  • Browser.setWindowBounds — headless Chrome has no OS-level window
  • Input.dispatchTouchEventtouchEnd/touchMove need a prior touchStart in the same session

The resolveRuntimeArgs() call is also a band-aid for the eval harness (rewriting Page.navigate URLs into data: URIs so DNS resolution doesn't affect results, defaulting Network.deleteCookies URL). The two DOM.* runtime cases (scrollIntoViewIfNeeded, setFileInputFiles) have moved into the grounding layer.

Failure analysis and baseline diffs

bun run eval:exec         # produce outputs/eval/exec-report.json
bun run eval:baseline     # freeze current state as baseline
# …iterate on dataset constraints, retrain…
bun run eval:exec
bun run eval:analyze      # diff new vs baseline; rows annotated FIXED / NEW / (=) / (±N)

Reads outputs/eval/exec-report.json; pass --baseline <path> to compare against a different file.

Grounding benchmark

Stand-alone test suite for the grounding layer in isolation (no model): fixtures in dataset/grounding-fixtures/*.html + cases in dataset/grounding-cases.json.

bun run bench:ground
bun run bench:ground:baseline   # snapshot the current results

15 cases across basic.html, landmarks.html, repeats.html today — all passing. Covers coord rewrite, ordinal picking ("click the 3rd reply"), landmark filtering ("click the login in the header"), and role-specific target sets. Plus 83 unit tests in src/grounding.test.ts for the matcher internals.

Grounding (AX-tree element resolution)

FunctionGemma emits Input.dispatchMouseEvent, Input.insertText, etc. with guessed coordinates and targets — it can't see the page. src/grounding.ts intercepts predicted calls, fetches the live page's accessibility tree via CDP (Accessibility.getFullAXTree), fuzzy-matches the user's utterance against interactive elements' accessible names using fuse.js, and rewrites call args accordingly. Model args stay intact as a fallback when grounding can't resolve.

Four modes, dispatched by tool:

Tool Mode What grounding does
Input.dispatchMouseEvent coord rewrite Finds target, rewrites x/y to its bounding-box centre (DOM.getBoxModel), scrolls into the layout viewport first, hit-tests for overlay occlusion
Input.dispatchDragEvent coord rewrite Same — single-target drag point
Input.dispatchTouchEvent coord rewrite Same — touch point
Input.insertText focus target Finds target (filtered to textbox/searchbox/combobox), calls DOM.focus; call passes through unchanged
Input.dispatchKeyEvent focus target Same — focuses the named target so the keypress lands there
DOM.scrollIntoViewIfNeeded nodeId inject Finds target (broader role set incl. headings, regions, landmarks), injects the real frontend nodeId
DOM.setFileInputFiles nodeId inject Injects nodeId; preserves the model's files arg
Page.navigateToHistoryEntry session-state pull Replaces fabricated entryId with the real one from Page.getNavigationHistory

All other tools pass through with grounded: false, reason: "not applicable".

Utterance parsing (powered by compromise for NLP, fuse.js for string matching):

  • Verbs and fillers stripped (click, the, type, in)
  • Role words stripped (button, link, checkbox) — they describe the AX role, not the name
  • Quoted content removed: "type 'github' in the search" → match runs against "in the search", not "github"
  • Ordinals via compromise: "click the 3rd comments link"hits[2]; "click the last reply" → final match; compound forms like "twenty-third" supported
  • Context hints: "click the login in the header" filters candidates to those whose AX ancestor chain includes banner/navigation roles. Ambiguous contexts (search, form) fall back to no-filter matching when the landmark isn't present; unambiguous landmarks (header/footer/sidebar/main/nav) stay strict

Three entry points:

# End-to-end: model → ground → execute on a real page
bun run try --url https://news.ycombinator.com "click the comments link"

# Execution eval with grounding applied; summary adds
# a "Grounding — predictions rewritten via AX tree: N/total" line
bun run eval:exec --grounding

# Pure AX-tree query (no model) — preview what grounding would pick
bun run observe --url https://news.ycombinator.com "comments"

Not in scope.

  • Multi-step decomposition ("search for github" = focus + type + Enter). The model emits one call per inference; the grounding layer deliberately doesn't plan sequences. Compound prompts with explicit connectives are handled earlier by src/split-prompt.ts.
  • Iframe / shadow-DOM grounding. Only the top frame's AX tree is queried today, so elements inside iframes (cookie consent, Stripe checkouts, embedded widgets) are invisible. Phased plan in src/grounding.ts: same-process iframes via Page.getFrameTree + per-frame Accessibility.getFullAXTree, then OOPIF support.

Runtime: server + Chrome extension

CDP adapter abstraction

src/grounding.ts and the server code depend only on a CdpAdapter interface — cdp(method, params) → Promise<unknown>. Four transports implement it:

Backend File Use case
Local Bun.WebView src/cdp/webview.ts Server owns a headless Chrome (default for try, eval:exec)
Remote Chrome over WS src/cdp/remote.ts --remote-debugging-port=9222 (CI farms, Docker)
Proxy to client src/cdp/proxy.ts Extension bridges the active tab's chrome.debugger over the WS
kernel.sh cloud browser src/kernel.ts + src/cdp/remote.ts Server mints a kernel.sh session, pipes cdp_ws_url into remoteAdapter

Bun inference server

bun run serve                              # default :3765, gesposito/chrome-diopside-onnx
PORT=4000 bun run serve
CDP_MODEL=user/my-model bun run serve
bun run serve --release v20                # pin ONNX + tokenizer SHAs from releases/v20.json

Loads the model + compressed tools catalog once at startup, then exposes:

Endpoint Purpose
GET /health {ok, model, revision, startedAt, toolsLoaded}
POST /infer Stateless: {userText}{prediction, raw, latencyMs}. No grounding, no execution.
WS /ws Stateful session (see below)

WebSocket messages are JSON with a type field.

Client → server:

Message Meaning
{type: "init", target, initialUrl?, kernelOptions?} Open a session. target: "webview" (default, server-owned), "proxy" (client forwards CDP), "kernel" (server mints a kernel.sh session), or any ws://… URL for a bring-your-own remote Chrome.
{type: "user-text", text, msgId?, numTools?, maxTokens?} Infer → ground → execute against the session's adapter.
{type: "transcribe", id, audioBase64, task?} Transcribe mono 16 kHz Float32 PCM (base64-encoded). task: "translate" outputs English regardless of input language.
{type: "cdp-response", id, result? | error?} Proxy-mode only: reply to a server cdp-call.

Server → client:

Message Meaning
{type: "ready", target, sessionId?, liveViewUrl?} Session ready. liveViewUrl populated for non-headless kernel.sh sessions.
{type: "prediction", tool, arguments, latencyMs} Raw model output before grounding.
{type: "grounding", rewritten, ...} Whether grounding fired + before/after args.
{type: "execution", ok, result? | error?} CDP call result.
{type: "cdp-call", id, method, params} Proxy-mode: server asks the client to invoke a CDP method.
{type: "transcription", id, text} Whisper output.
{type: "error", where, message} Attach / transport / exec failure.

Compound-prompt splitting

src/split-prompt.ts runs before inference on the WS path. Explicit connectives (then, , then, ;, sentence boundaries) always split; bare and only splits when every resulting piece starts with an imperative verb. Quoted substrings are masked first so "type 'cats and dogs'" never splits mid-string. One false positive ("click the OK and Cancel buttons" getting split) is worse than any number of false negatives, so the heuristic is deliberately conservative.

This module becomes dead code once the fine-tune learns to emit arrays of tool calls for compound prompts.

Voice mode

The side panel includes a microphone button that records via @ricky0123/vad-web (Silero VAD) and streams the PCM buffer to the server's transcribe endpoint, which runs Whisper-base (multilingual) through @huggingface/transformers + onnxruntime-node. Transcribed text is fed straight into the normal prompt pipeline.

  • VAD assets (silero_vad_v5.onnx, vad.worklet.bundle.min.js, ort-wasm-simd-threaded.{mjs,wasm}) are copied into the extension build via vite-plugin-static-copy (flat layout, since MicVAD fetches them by fixed filename relative to baseAssetPath).
  • VAD runtime + ORT bundle (~26 MB) is code-split via dynamic import() on first mic tap — users who never record pay zero bundle cost.
  • task="translate" flag on the transcribe message switches Whisper to source → English.
  • A compact caption overlay is painted in the active tab via chrome.scripting.executeScript for feedback while speaking.
  • Provider is pluggable via TRANSCRIBE_PROVIDER env; see src/transcribe/provider.ts.

Chrome extension (WXT, MV3)

extension/ is a thin bridge — side-panel UI is React + shadcn/ui + Tailwind v4; background worker is a chrome.debugger ↔ WebSocket forwarder. All inference, grounding, and execution happen server-side.

┌───────────────────────────────┐            ┌──────────────────────┐
│ Chrome extension (MV3)        │            │ Bun server :3765     │
│                               │            │                      │
│  entrypoints/sidepanel/       │ runtime.   │  WsSession           │
│  App.tsx ── user prompt ──────┼─ Port ────►│  ├─ proxyAdapter     │
│                               │            │  ├─ inference        │
│  entrypoints/background.ts    │            │  ├─ splitIntoSteps   │
│  ├─ browser.debugger.attach   │ WebSocket  │  ├─ groundToolCall   │
│  ├─ WS client ────────────────┼────────────┤  └─ adapter.cdp(…) ──┼─┐
│  └─ runs CDP calls the        │            │                      │ │
│     server proxies back ◄─────┼────────────┼──── cdp-call ────────┼─┘
└───────────────────────────────┘            └──────────────────────┘

One-time setup:

cd extension
bun install
bun run dev          # rebuild on change → extension/.output/chrome-mv3/

Then in Chrome:

  1. chrome://extensions → Developer mode ON
  2. Load unpacked → select extension/.output/chrome-mv3/
  3. Start the server in a separate terminal: bun run serve
  4. Click the extension action → side panel opens → type or speak a prompt

When a side-panel session is active, Chrome shows Chrome Diopside started debugging this browser (any use of chrome.debugger triggers it — not suppressible).

Backends. The side panel exposes a dropdown selector: "local" (default, attaches to the user's active tab via chrome.debugger) and "kernel" (server spins up a kernel.sh cloud browser; live-view URL is opened in a new tab). Kernel sessions auto-delete on WS close or after timeoutSeconds (default 300).

Files:

File Role
extension/wxt.config.ts Manifest + permissions (debugger, tabs, storage, scripting); static-copy plugin for VAD assets
extension/entrypoints/background.ts Service worker: per-port Session owns one browser.debugger attachment + one WS. On server cdp-call, invokes browser.debugger.sendCommand and replies with cdp-response. Auto-attaches on tab create; reattaches on tab switch.
extension/entrypoints/sidepanel/App.tsx React side panel: long-lived browser.runtime.Port to background, backend selector, voice toggle, streamed response log
extension/lib/voice-mode.ts VAD + transcription orchestrator (dynamically imported)
extension/lib/ws-transport.ts AI-SDK transport wrapping the WS
extension/lib/panel-connection.ts Port lifecycle + reconnection

Release pinning

"v22" isn't a single artifact — it's a bundle of dataset + base model + LoRA + ONNX + training code + eval metrics. bun run release captures the commit SHAs of each piece (via the HF API), the W&B run ID, the current git HEAD, and the eval metrics into releases/<name>.json. The file, once committed and tagged, documents the bundle.

bun run release v22 h0qiilug --notes "Browser.setWindowBounds tightened"

# From the pipeline Ship stage (runId auto-filled from the selected run;
# name auto-generated) → produces releases/r-YYYYMMDD-xxxxxx.json which
# you can rename before tagging.

git add releases/v22.json && git commit -m "Release v22"
git tag v22 && git push --tags

Captured fields: dataset, base_model, lora, onnx (each with repo + sha); wandb.run_id + URL; git_sha; eval_metrics (tool/params/combined/FP) from outputs/eval/exec-report.json at freeze time.

Partially wired. infer.ts, try.ts, and serve all take --release <name> and pin the ONNX + tokenizer SHAs via loadModels({ revision, baseTokenizerRevision }). eval-exec.ts doesn't yet, and main.py's snapshot_download doesn't take a revision= arg — so remote eval still pulls HEAD regardless of release. Until those land, a release file documents the bundle but doesn't force full reproducibility.

Project structure

main.py                       # Cerebrium endpoints: train, eval inference, push_lora, convert
run.py                        # W&B sweep agent (spawned by sweep.ts)
cerebrium.toml                # Cerebrium deployment config
pyproject.toml / uv.lock      # Python deps (source of truth)
requirements.txt              # Auto-generated from uv.lock for Cerebrium
src/
  browser.ts                  # Bun.WebView wrapper + executeCdp with timeout
  queue.ts                    # Sequential CDP command queue with retry policy
  inference.ts                # Shared inference module (used by infer.ts, server, try.ts)
  grounding.ts                # AX-tree element resolution (8 CDP tools, 4 modes)
  split-prompt.ts             # Compound-prompt splitter (quoted-text masking + POS gate)
  releases.ts                 # Release manifest loader
  server.ts                   # Bun inference server: HTTP /infer + WS /ws
  kernel.ts                   # kernel.sh REST client (cloud Browsers-as-a-Service)
  whisper.ts                  # Transcription façade — picks a provider at runtime
  transcribe/
    provider.ts               # TranscriptionProvider interface
    transformers-js.ts        # Whisper via @huggingface/transformers
  cdp/
    adapter.ts                # CdpAdapter interface
    webview.ts                # Local Bun.WebView adapter
    remote.ts                 # Remote Chrome over ws://host:9222/devtools/page/<id>
    proxy.ts                  # Server-side adapter forwarding CDP to a client (extension)
  grounding.test.ts           # 83 unit tests covering matcher internals
  split-prompt.test.ts        # Splitter unit tests
scripts/
  generate-dataset.ts         # Synthetic dataset generation via Claude API
  list-cdp-commands.ts        # Extract unique CDP commands used
  build-tools.ts              # Build tool definitions from the CDP protocol spec
  push-dataset.ts             # Upload dataset/formatted/ to HuggingFace Hub
  train.ts                    # Trigger a single training run on Cerebrium
  sweep.ts                    # Launch hyperparameter sweep (spawns run.py)
  watch.ts                    # Poll W&B until a run terminates
  push-lora.ts                # Push trained LoRA from Cerebrium to HF Hub
  convert.ts                  # Merge LoRA + export ONNX (on Cerebrium)
  eval.ts                     # String-match eval orchestrator (trigger, fetch, judge)
  eval-exec.ts                # Execution-based eval against real Chrome
  eval_inference.py           # Model forward pass (local eval only)
  analyze-eval.ts             # Confusion pairs + FP breakdown + baseline diff
  grounding-bench.ts          # Grounding benchmark on static HTML fixtures
  infer.ts                    # Local inference CLI (pure Bun, no Python)
  try.ts                      # End-to-end: infer + ground + execute on a URL
  observe.ts                  # AX-tree candidate listing for a query (no model)
  release.ts                  # Freeze a release manifest
  pipeline.tsx                # Ink TUI — pipeline control panel
  pipeline/
    wandb.ts                  # W&B GraphQL client + helpers
    steps.ts                  # Pipeline stage/step definitions
    components.tsx            # Shared UI components
  clean.ts                    # Remove generated artifacts
  smoke-kv-cache.ts           # KV cache smoke test
extension/                    # Chrome MV3 extension (WXT + React + Tailwind v4)
  wxt.config.ts               # Manifest + permissions + static-copy for VAD assets
  entrypoints/
    background.ts             # Service worker: chrome.debugger ↔ WS bridge
    sidepanel/                # React side panel UI
  lib/
    panel-connection.ts       # Port lifecycle + reconnection
    ws-transport.ts           # AI-SDK WS transport
    voice.ts / voice-mode.ts  # Voice capture orchestrator
dataset/
  formatted/                  # Training data (also pushed to HF Hub)
  grounding-cases.json        # Grounding benchmark cases
  grounding-fixtures/         # HTML fixtures for the grounding benchmark
releases/                     # Frozen release manifests (one JSON per bundle)
outputs/                      # gitignored
  eval/                       # eval reports + cached predictions
  grounding/                  # grounding benchmark reports
  onnx/                       # ONNX model (after conversion)
  hf-cache/                   # cached HF models for local inference
vendor/
  devtools-protocol/          # submodule — CDP protocol JSON spec
  unslothai-notebooks/        # submodule — fine-tuning notebook references
  cerebriumai-examples/       # submodule — deployment examples

Limitations

  • Single-step only. The model emits one tool call per inference. src/split-prompt.ts handles compound prompts as a pre-model pass but won't infer implicit sequencing ("search for github" = focus + type + Enter). A multi-step fine-tune would require populating dataset/formatted/multi-step.json.
  • No iframe / shadow-DOM grounding. Cookie consent banners, Stripe checkouts, embedded widgets are invisible to the grounding layer today. Plan in src/grounding.ts under "FUTURE".
  • Sweep is parallel random search, not Bayesian. See the Sweep section above.
  • Release pinning is partial. serve, infer, try honour --release; eval-exec and remote training don't.
  • 281-example test set. 95% CI around a 0.95 proportion is roughly ±2.5%. Sub-1-point differences between runs are inside the noise floor.

About

Hands-free browsing (proof-of-concept). 270M fine-tune of FunctionGemma for single-step Chrome DevTools Protocol calls, plus AX-tree grounding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors