A 270M fine-tune of FunctionGemma for Chrome DevTools Protocol tool-calling, plus an AX-tree grounding layer, shipped as a Chrome side-panel extension. Type (or speak) a prompt into the side panel; a local Bun server runs inference → grounds the call against the live accessibility tree → drives the active tab via chrome.debugger. All model inference runs locally via ONNX.
chrome-diopside-resized.mp4
bun install
git submodule update --init # devtools-protocol + unslothai-notebooks + cerebriumai-examples
cp .env.example .env # fill in HF_TOKEN, WANDB_API_KEY, CEREBRIUM_API_KEY, KERNEL_API_KEYBun auto-loads .env for all local scripts; Python scripts inherit it via child-process spawning.
# ── Dataset ──────────────────────────────────────────────────
bun run generate # synthesize examples via Claude (Anthropic API)
bun run build # commands + tools catalog from the CDP spec
bun run push-dataset # upload to HuggingFace Hub
# ── Training ─────────────────────────────────────────────────
bun run deploy # push main.py + deps to Cerebrium (only when code/deps change)
bun run train --name v22 # single run
bun run sweep # parallel random-search sweep (see caveat below)
bun run watch v22 # poll W&B until the run terminates
bun run pipeline # Ink TUI: navigate stages, fire steps, auto-fill W&B IDs
# ── Evaluation ───────────────────────────────────────────────
bun run eval:remote # trigger string-match eval on Cerebrium GPU (async)
bun run eval:fetch # fetch results + write report
bun run eval:exec # execution-based eval against real Chrome
bun run eval:exec --grounding # …with AX-tree grounding applied to predicted calls
bun run eval:analyze # confusion pairs + FP breakdown (diffs vs baseline if set)
bun run eval:baseline # snapshot current report as the baseline
bun run bench:ground # grounding benchmark on static HTML fixtures
bun run bench:ground:baseline # snapshot the grounding bench
# ── Ship ─────────────────────────────────────────────────────
bun run push <wandb_run_id> # push winning LoRA to HuggingFace Hub
bun run convert <wandb_id> # merge LoRA + export ONNX on Cerebrium
bun run infer "message" # local inference, pure Bun + ONNX
bun run try --url <url> "msg" # end-to-end: infer + ground + execute on a real page
bun run observe --url <url> "query" # list AX-tree candidates matching a query (no model)
bun run release <name> <wandb_id> # freeze release manifest (SHAs + metrics)
# ── Runtime ──────────────────────────────────────────────────
bun run serve # inference server only: HTTP /infer + WS /ws
bun run ext # extension dev build (WXT, loads as MV3 unpacked)
bun run dev # both in parallelDataset changes only need bun run push-dataset — no redeploy required. The Cerebrium image downloads the dataset from HuggingFace Hub at runtime. Code changes (main.py, cerebrium.toml) or Python dep changes need bun run deploy.
scripts/generate-dataset.ts asks Claude (default claude-sonnet-4-6, temp 0.7) to produce a batch of (user_intent, tool_call) pairs for each of the 26 user-facing CDP tools. Per-tool prompts carry:
description— what the tool doesexamples— 5-20 canonical natural-language phrasings (seed examples used as few-shot)paramHints— what each parameter means, valid ranges, and which are requiredconstraint— tight disambiguation from neighbouring tools (e.g.Input.dispatchKeyEventvsPage.handleJavaScriptDialog,Page.printToPDFvsEmulation.setEmulatedMedia)
Each per-tool request is stored in dataset/formatted/generation-manifest.json with its Anthropic response_id, so any batch is traceable to its source API call.
dataset/formatted/tools.json is a flat list of {name, description, parameters} objects built from the CDP spec in vendor/devtools-protocol. 28 tools total: the 26 user-facing ones plus 2 the model needs to recognise from plumbing (Target.createTarget, Target.attachedToTarget).
| File | Description |
|---|---|
single-step.json |
1 user intent → 1 tool call — the only split currently used for training |
cdp-commands.json |
Unique CDP commands referenced in the dataset |
tools.json |
Tool definitions with parameters from the CDP spec |
generation-manifest.json |
Per-tool audit trail (model, temp, response_ids) |
Training data format:
[
{ "role": "user", "content": "click on \"#submit\"" },
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "Input.dispatchMouseEvent",
"arguments": { "type": "mousePressed", "x": 10, "y": 20, "button": "left" }
}
}
]
}
]Known limitation. The model emits one tool call per inference. Compound prompts ("open a new tab and then go to github") are handled pre-model by src/split-prompt.ts (connectives + verb-led and, with quoted-text masking). multi-step.json is emitted as an empty placeholder (downstream code expects the path to exist); multi-turn conversations are out of scope for the current generator.
Fine-tuning runs on Cerebrium (serverless GPU) with Weights & Biases for experiment tracking.
┌──────────────────────────┐ ┌──────────────────────────┐
│ run.py (your Mac) │ │ Cerebrium GPU replica │
│ │ │ │
│ W&B sweep agent │ HTTP │ main.py │
│ ├─ pick hyperparams ────┼────────►│ └─ train_model(params) │
│ ├─ trigger experiment │ async │ ├─ load FunctionGemma│
│ └─ poll W&B for results │ │ ├─ download dataset │
│ │ ◄───────┤ ├─ SFTTrainer w/ LoRA│
│ │ wandb │ ├─ save LoRA │
└──────────────────────────┘ │ └─ report to W&B │
└──────────────────────────┘
main.pyexposestrain_model,run_eval_inference,get_eval_predictions,push_lora,convert_to_onnx,merge_and_push. Cerebrium turns each into an HTTP endpoint.run.pyruns locally. Defines a W&B sweep config, then for each experiment posts the chosen hyperparameters to the Cerebrium endpoint asynchronously.cerebrium.toml— GPU type (A10), scaling (0–3 replicas), 1-hour response grace period for long runs.pyproject.toml+uv.lock— Python deps via uv.requirements.txtis auto-generated from the lockfile for Cerebrium.
The dataset is not bundled in the Cerebrium image. It lives on HuggingFace Hub and is downloaded at runtime. Dataset changes only need bun run push-dataset, not a full bun run deploy.
- Accounts — sign up for Cerebrium, W&B, accept the FunctionGemma license on HuggingFace.
- HuggingFace token with Write access — needed to download FunctionGemma and to upload trained adapters. Create at https://huggingface.co/settings/tokens.
- Local env vars —
cp .env.example .env. The same keys need to be set as Cerebrium secrets (their dashboard) somain.pycan use them at training time:HF_TOKEN # write-enabled WANDB_API_KEY HF_DATASET_REPO # e.g. gesposito/chrome-diopside-dataset - Cerebrium CLI:
uv tool install cerebrium && uv tool run cerebrium login.
Managed with uv; the lockfile is source of truth.
uv add torch # → updates pyproject.toml + uv.lock + installs
uv remove peft
uv sync # install from lockfile
uv run python main.pyNever use uv pip — it bypasses the lockfile and gets wiped on the next uv sync.
Deps split into two groups in pyproject.toml:
- Base (
uv sync) — torch, transformers, peft, wandb. For local eval. - Cerebrium (
uv sync --extra cerebrium) — adds unsloth, trl, optimum. For local training.
bun run deploy auto-exports the full cerebrium deps to requirements.txt.
The 28-tool catalog in JSON Schema form expanded past the 270M model's 8k context window after the chat template wrapped each string value in <escape> tags. Earlier revisions had 63 tools and the problem was more acute.
Earlier versions worked around this by subsetting tools per example (correct tool + random distractors) with an embedding-based pre-filter at inference. That created a distribution mismatch: training saw 4 tools, inference saw 8 embedding-picked ones — costing ~20 points of accuracy.
compress_tools() in main.py strips parameter schemas, keeping only {name, description} with required-param names appended to the description. Per-tool tokens drop from ~1700 → ~100, so the full catalog fits in every prompt. The model learns parameter formats from examples rather than from the prompt — the pattern Gorilla and ToolLLaMA use for small models on fixed APIs.
| Approach | Tools per prompt | Train ↔ inference match | Tool match (exec) |
|---|---|---|---|
| Subsetting + embedding pre-filter (legacy) | ~8 (semantic) | mismatched | 62.6% |
| Baked-in compressed schemas | Full catalog | matched | 95.7% |
Important. Examples must be text-formatted before being wrapped in a HuggingFace Dataset. Dataset.from_list() infers a unified schema and pads missing keys with None, silently reintroducing null pollution even if the source JSON is clean. The [dataset] sanity check log in main.py catches this.
Baked into main.py, verified on L40 (48 GB):
| Setting | Value | Why |
|---|---|---|
batch_size |
1 | Larger batches OOM with long sequences |
gradient_accumulation_steps |
8 | Effective batch size of 8 |
lora_r |
32 | Sufficient for 270M |
lora_alpha |
64 | 2× rank |
max_seq_length |
8192 | Fits compressed tools + conversation |
num_train_epochs |
3 | Loss plateaus around epoch 2-3 |
learning_rate |
2e-4 | Standard for LoRA on small models |
bun run train --name v22
bun run train --name v22 --hub-repo gesposito/chrome-diopside
bun run train --name v22 --lora-r 64Fires train_model async on Cerebrium. Monitor via bun run watch v22 or bun run pipeline.
bun run sweep # 5 experiments
bun run sweep --count 10Each run trains in parallel up to max_replicas in cerebrium.toml. Replicas scale to zero between runs — pay only for actual GPU seconds.
Known limitation: this is parallel random search, not Bayesian optimisation. The W&B agent creates a local run, fires an async Cerebrium call, and closes immediately — training metrics report to a separate W&B run on Cerebrium, so the optimiser never sees results. The fix is to pass the sweep run id to Cerebrium so main.py resumes it instead of creating a new run; see the run.py docstring.
Sweep runs don't push to HuggingFace Hub (hub_repo is omitted). Each LoRA saves to Cerebrium's persistent volume only. After the sweep:
- Compare runs in W&B
- Pick the winner using
eval.tsaccuracy metrics, noteval/loss(lower loss ≠ higher accuracy) bun run push <wandb_run_id>to promote
bun run push lpm34vo5 # W&B run ID
bun run push lpm34vo5 --repo gesposito/my-model
bun run convert lpm34vo5 # merge LoRA + export ONNX
bun run convert --path /persistent-storage/.../loraconvert runs on Cerebrium — it merges the LoRA, exports ONNX, and uploads to {hub_repo}-onnx. No local Python needed.
Every test example sees the full compressed tools catalog — the same prompt the model saw during training.
# Remote: two-step async flow
bun run eval --remote # triggers base + fine-tuned inference (async)
bun run eval --remote kdi0mc6u # same, for a specific W&B run
bun run eval --fetch # reads pending.json, fetches both runs, judges, reports
# Local: from HuggingFace Hub
bun run eval --hub-repo your-username/chrome-diopside-cdp --python .venv/bin/python
bun run eval --lora-dir ./model --python .venv/bin/pythonBase predictions cache in outputs/eval/base-predictions.json:
| Scenario | What happens |
|---|---|
--remote |
Fresh base run on Cerebrium (no local cache used) |
--fetch |
Fetches base predictions from pending.json |
| Local + cache exists | Reused (fast) |
| Local + cache missing | Generated locally via Python (~10 min on Apple Silicon) |
--refresh-base (local) |
Regenerated regardless of cache |
Outputs: outputs/eval/report.{md,json}.
String match tells you if the model picks the right tool; execution eval tells you if the generated CDP commands actually run.
bun run eval:exec # full test set (~280 examples)
bun run eval:exec --max-examples 20 # quick subset
bun run eval:exec --predictions <p> # skip inference, use saved predictions
bun run eval:exec --grounding # apply AX-tree grounding to predicted callsSpawns a headless Chrome via Bun.WebView and runs both expected and predicted commands per example. Reports tool match, execution success, and false positives (right tool, bad params). Output at outputs/eval/exec-report.{md,json}.
Some commands need browser state the model can't produce (focused inputs, open dialogs, resolved nodeIds). getSetupCommands() in eval-exec.ts navigates to a setup page per method; resolveRuntimeArgs() injects runtime values (history entryId, etc.) into the command. Destructive commands (Page.close, Browser.close) skip execution and recreate the browser.
Tools in SKIP_EXECUTION — string-matched but not actually run — so they don't inflate FP:
Page.navigateToHistoryEntry— needs a real browsing historyBrowser.setWindowBounds— headless Chrome has no OS-level windowInput.dispatchTouchEvent—touchEnd/touchMoveneed a priortouchStartin the same session
The resolveRuntimeArgs() call is also a band-aid for the eval harness (rewriting Page.navigate URLs into data: URIs so DNS resolution doesn't affect results, defaulting Network.deleteCookies URL). The two DOM.* runtime cases (scrollIntoViewIfNeeded, setFileInputFiles) have moved into the grounding layer.
bun run eval:exec # produce outputs/eval/exec-report.json
bun run eval:baseline # freeze current state as baseline
# …iterate on dataset constraints, retrain…
bun run eval:exec
bun run eval:analyze # diff new vs baseline; rows annotated FIXED / NEW / (=) / (±N)Reads outputs/eval/exec-report.json; pass --baseline <path> to compare against a different file.
Stand-alone test suite for the grounding layer in isolation (no model): fixtures in dataset/grounding-fixtures/*.html + cases in dataset/grounding-cases.json.
bun run bench:ground
bun run bench:ground:baseline # snapshot the current results15 cases across basic.html, landmarks.html, repeats.html today — all passing. Covers coord rewrite, ordinal picking ("click the 3rd reply"), landmark filtering ("click the login in the header"), and role-specific target sets. Plus 83 unit tests in src/grounding.test.ts for the matcher internals.
FunctionGemma emits Input.dispatchMouseEvent, Input.insertText, etc. with guessed coordinates and targets — it can't see the page. src/grounding.ts intercepts predicted calls, fetches the live page's accessibility tree via CDP (Accessibility.getFullAXTree), fuzzy-matches the user's utterance against interactive elements' accessible names using fuse.js, and rewrites call args accordingly. Model args stay intact as a fallback when grounding can't resolve.
Four modes, dispatched by tool:
| Tool | Mode | What grounding does |
|---|---|---|
Input.dispatchMouseEvent |
coord rewrite | Finds target, rewrites x/y to its bounding-box centre (DOM.getBoxModel), scrolls into the layout viewport first, hit-tests for overlay occlusion |
Input.dispatchDragEvent |
coord rewrite | Same — single-target drag point |
Input.dispatchTouchEvent |
coord rewrite | Same — touch point |
Input.insertText |
focus target | Finds target (filtered to textbox/searchbox/combobox), calls DOM.focus; call passes through unchanged |
Input.dispatchKeyEvent |
focus target | Same — focuses the named target so the keypress lands there |
DOM.scrollIntoViewIfNeeded |
nodeId inject | Finds target (broader role set incl. headings, regions, landmarks), injects the real frontend nodeId |
DOM.setFileInputFiles |
nodeId inject | Injects nodeId; preserves the model's files arg |
Page.navigateToHistoryEntry |
session-state pull | Replaces fabricated entryId with the real one from Page.getNavigationHistory |
All other tools pass through with grounded: false, reason: "not applicable".
Utterance parsing (powered by compromise for NLP, fuse.js for string matching):
- Verbs and fillers stripped (
click,the,type,in) - Role words stripped (
button,link,checkbox) — they describe the AX role, not the name - Quoted content removed:
"type 'github' in the search"→ match runs against"in the search", not"github" - Ordinals via compromise:
"click the 3rd comments link"→hits[2];"click the last reply"→ final match; compound forms like"twenty-third"supported - Context hints:
"click the login in the header"filters candidates to those whose AX ancestor chain includesbanner/navigationroles. Ambiguous contexts (search,form) fall back to no-filter matching when the landmark isn't present; unambiguous landmarks (header/footer/sidebar/main/nav) stay strict
Three entry points:
# End-to-end: model → ground → execute on a real page
bun run try --url https://news.ycombinator.com "click the comments link"
# Execution eval with grounding applied; summary adds
# a "Grounding — predictions rewritten via AX tree: N/total" line
bun run eval:exec --grounding
# Pure AX-tree query (no model) — preview what grounding would pick
bun run observe --url https://news.ycombinator.com "comments"Not in scope.
- Multi-step decomposition ("search for github" = focus + type + Enter). The model emits one call per inference; the grounding layer deliberately doesn't plan sequences. Compound prompts with explicit connectives are handled earlier by
src/split-prompt.ts. - Iframe / shadow-DOM grounding. Only the top frame's AX tree is queried today, so elements inside iframes (cookie consent, Stripe checkouts, embedded widgets) are invisible. Phased plan in
src/grounding.ts: same-process iframes viaPage.getFrameTree+ per-frameAccessibility.getFullAXTree, then OOPIF support.
src/grounding.ts and the server code depend only on a CdpAdapter interface — cdp(method, params) → Promise<unknown>. Four transports implement it:
| Backend | File | Use case |
|---|---|---|
Local Bun.WebView |
src/cdp/webview.ts |
Server owns a headless Chrome (default for try, eval:exec) |
| Remote Chrome over WS | src/cdp/remote.ts |
--remote-debugging-port=9222 (CI farms, Docker) |
| Proxy to client | src/cdp/proxy.ts |
Extension bridges the active tab's chrome.debugger over the WS |
| kernel.sh cloud browser | src/kernel.ts + src/cdp/remote.ts |
Server mints a kernel.sh session, pipes cdp_ws_url into remoteAdapter |
bun run serve # default :3765, gesposito/chrome-diopside-onnx
PORT=4000 bun run serve
CDP_MODEL=user/my-model bun run serve
bun run serve --release v20 # pin ONNX + tokenizer SHAs from releases/v20.jsonLoads the model + compressed tools catalog once at startup, then exposes:
| Endpoint | Purpose |
|---|---|
GET /health |
{ok, model, revision, startedAt, toolsLoaded} |
POST /infer |
Stateless: {userText} → {prediction, raw, latencyMs}. No grounding, no execution. |
WS /ws |
Stateful session (see below) |
WebSocket messages are JSON with a type field.
Client → server:
| Message | Meaning |
|---|---|
{type: "init", target, initialUrl?, kernelOptions?} |
Open a session. target: "webview" (default, server-owned), "proxy" (client forwards CDP), "kernel" (server mints a kernel.sh session), or any ws://… URL for a bring-your-own remote Chrome. |
{type: "user-text", text, msgId?, numTools?, maxTokens?} |
Infer → ground → execute against the session's adapter. |
{type: "transcribe", id, audioBase64, task?} |
Transcribe mono 16 kHz Float32 PCM (base64-encoded). task: "translate" outputs English regardless of input language. |
{type: "cdp-response", id, result? | error?} |
Proxy-mode only: reply to a server cdp-call. |
Server → client:
| Message | Meaning |
|---|---|
{type: "ready", target, sessionId?, liveViewUrl?} |
Session ready. liveViewUrl populated for non-headless kernel.sh sessions. |
{type: "prediction", tool, arguments, latencyMs} |
Raw model output before grounding. |
{type: "grounding", rewritten, ...} |
Whether grounding fired + before/after args. |
{type: "execution", ok, result? | error?} |
CDP call result. |
{type: "cdp-call", id, method, params} |
Proxy-mode: server asks the client to invoke a CDP method. |
{type: "transcription", id, text} |
Whisper output. |
{type: "error", where, message} |
Attach / transport / exec failure. |
src/split-prompt.ts runs before inference on the WS path. Explicit connectives (then, , then, ;, sentence boundaries) always split; bare and only splits when every resulting piece starts with an imperative verb. Quoted substrings are masked first so "type 'cats and dogs'" never splits mid-string. One false positive ("click the OK and Cancel buttons" getting split) is worse than any number of false negatives, so the heuristic is deliberately conservative.
This module becomes dead code once the fine-tune learns to emit arrays of tool calls for compound prompts.
The side panel includes a microphone button that records via @ricky0123/vad-web (Silero VAD) and streams the PCM buffer to the server's transcribe endpoint, which runs Whisper-base (multilingual) through @huggingface/transformers + onnxruntime-node. Transcribed text is fed straight into the normal prompt pipeline.
- VAD assets (
silero_vad_v5.onnx,vad.worklet.bundle.min.js,ort-wasm-simd-threaded.{mjs,wasm}) are copied into the extension build viavite-plugin-static-copy(flat layout, since MicVAD fetches them by fixed filename relative tobaseAssetPath). - VAD runtime + ORT bundle (~26 MB) is code-split via dynamic
import()on first mic tap — users who never record pay zero bundle cost. task="translate"flag on the transcribe message switches Whisper to source → English.- A compact caption overlay is painted in the active tab via
chrome.scripting.executeScriptfor feedback while speaking. - Provider is pluggable via
TRANSCRIBE_PROVIDERenv; seesrc/transcribe/provider.ts.
extension/ is a thin bridge — side-panel UI is React + shadcn/ui + Tailwind v4; background worker is a chrome.debugger ↔ WebSocket forwarder. All inference, grounding, and execution happen server-side.
┌───────────────────────────────┐ ┌──────────────────────┐
│ Chrome extension (MV3) │ │ Bun server :3765 │
│ │ │ │
│ entrypoints/sidepanel/ │ runtime. │ WsSession │
│ App.tsx ── user prompt ──────┼─ Port ────►│ ├─ proxyAdapter │
│ │ │ ├─ inference │
│ entrypoints/background.ts │ │ ├─ splitIntoSteps │
│ ├─ browser.debugger.attach │ WebSocket │ ├─ groundToolCall │
│ ├─ WS client ────────────────┼────────────┤ └─ adapter.cdp(…) ──┼─┐
│ └─ runs CDP calls the │ │ │ │
│ server proxies back ◄─────┼────────────┼──── cdp-call ────────┼─┘
└───────────────────────────────┘ └──────────────────────┘
One-time setup:
cd extension
bun install
bun run dev # rebuild on change → extension/.output/chrome-mv3/Then in Chrome:
chrome://extensions→ Developer mode ON- Load unpacked → select
extension/.output/chrome-mv3/ - Start the server in a separate terminal:
bun run serve - Click the extension action → side panel opens → type or speak a prompt
When a side-panel session is active, Chrome shows Chrome Diopside started debugging this browser (any use of chrome.debugger triggers it — not suppressible).
Backends. The side panel exposes a dropdown selector: "local" (default, attaches to the user's active tab via chrome.debugger) and "kernel" (server spins up a kernel.sh cloud browser; live-view URL is opened in a new tab). Kernel sessions auto-delete on WS close or after timeoutSeconds (default 300).
Files:
| File | Role |
|---|---|
extension/wxt.config.ts |
Manifest + permissions (debugger, tabs, storage, scripting); static-copy plugin for VAD assets |
extension/entrypoints/background.ts |
Service worker: per-port Session owns one browser.debugger attachment + one WS. On server cdp-call, invokes browser.debugger.sendCommand and replies with cdp-response. Auto-attaches on tab create; reattaches on tab switch. |
extension/entrypoints/sidepanel/App.tsx |
React side panel: long-lived browser.runtime.Port to background, backend selector, voice toggle, streamed response log |
extension/lib/voice-mode.ts |
VAD + transcription orchestrator (dynamically imported) |
extension/lib/ws-transport.ts |
AI-SDK transport wrapping the WS |
extension/lib/panel-connection.ts |
Port lifecycle + reconnection |
"v22" isn't a single artifact — it's a bundle of dataset + base model + LoRA + ONNX + training code + eval metrics. bun run release captures the commit SHAs of each piece (via the HF API), the W&B run ID, the current git HEAD, and the eval metrics into releases/<name>.json. The file, once committed and tagged, documents the bundle.
bun run release v22 h0qiilug --notes "Browser.setWindowBounds tightened"
# From the pipeline Ship stage (runId auto-filled from the selected run;
# name auto-generated) → produces releases/r-YYYYMMDD-xxxxxx.json which
# you can rename before tagging.
git add releases/v22.json && git commit -m "Release v22"
git tag v22 && git push --tagsCaptured fields: dataset, base_model, lora, onnx (each with repo + sha); wandb.run_id + URL; git_sha; eval_metrics (tool/params/combined/FP) from outputs/eval/exec-report.json at freeze time.
Partially wired. infer.ts, try.ts, and serve all take --release <name> and pin the ONNX + tokenizer SHAs via loadModels({ revision, baseTokenizerRevision }). eval-exec.ts doesn't yet, and main.py's snapshot_download doesn't take a revision= arg — so remote eval still pulls HEAD regardless of release. Until those land, a release file documents the bundle but doesn't force full reproducibility.
main.py # Cerebrium endpoints: train, eval inference, push_lora, convert
run.py # W&B sweep agent (spawned by sweep.ts)
cerebrium.toml # Cerebrium deployment config
pyproject.toml / uv.lock # Python deps (source of truth)
requirements.txt # Auto-generated from uv.lock for Cerebrium
src/
browser.ts # Bun.WebView wrapper + executeCdp with timeout
queue.ts # Sequential CDP command queue with retry policy
inference.ts # Shared inference module (used by infer.ts, server, try.ts)
grounding.ts # AX-tree element resolution (8 CDP tools, 4 modes)
split-prompt.ts # Compound-prompt splitter (quoted-text masking + POS gate)
releases.ts # Release manifest loader
server.ts # Bun inference server: HTTP /infer + WS /ws
kernel.ts # kernel.sh REST client (cloud Browsers-as-a-Service)
whisper.ts # Transcription façade — picks a provider at runtime
transcribe/
provider.ts # TranscriptionProvider interface
transformers-js.ts # Whisper via @huggingface/transformers
cdp/
adapter.ts # CdpAdapter interface
webview.ts # Local Bun.WebView adapter
remote.ts # Remote Chrome over ws://host:9222/devtools/page/<id>
proxy.ts # Server-side adapter forwarding CDP to a client (extension)
grounding.test.ts # 83 unit tests covering matcher internals
split-prompt.test.ts # Splitter unit tests
scripts/
generate-dataset.ts # Synthetic dataset generation via Claude API
list-cdp-commands.ts # Extract unique CDP commands used
build-tools.ts # Build tool definitions from the CDP protocol spec
push-dataset.ts # Upload dataset/formatted/ to HuggingFace Hub
train.ts # Trigger a single training run on Cerebrium
sweep.ts # Launch hyperparameter sweep (spawns run.py)
watch.ts # Poll W&B until a run terminates
push-lora.ts # Push trained LoRA from Cerebrium to HF Hub
convert.ts # Merge LoRA + export ONNX (on Cerebrium)
eval.ts # String-match eval orchestrator (trigger, fetch, judge)
eval-exec.ts # Execution-based eval against real Chrome
eval_inference.py # Model forward pass (local eval only)
analyze-eval.ts # Confusion pairs + FP breakdown + baseline diff
grounding-bench.ts # Grounding benchmark on static HTML fixtures
infer.ts # Local inference CLI (pure Bun, no Python)
try.ts # End-to-end: infer + ground + execute on a URL
observe.ts # AX-tree candidate listing for a query (no model)
release.ts # Freeze a release manifest
pipeline.tsx # Ink TUI — pipeline control panel
pipeline/
wandb.ts # W&B GraphQL client + helpers
steps.ts # Pipeline stage/step definitions
components.tsx # Shared UI components
clean.ts # Remove generated artifacts
smoke-kv-cache.ts # KV cache smoke test
extension/ # Chrome MV3 extension (WXT + React + Tailwind v4)
wxt.config.ts # Manifest + permissions + static-copy for VAD assets
entrypoints/
background.ts # Service worker: chrome.debugger ↔ WS bridge
sidepanel/ # React side panel UI
lib/
panel-connection.ts # Port lifecycle + reconnection
ws-transport.ts # AI-SDK WS transport
voice.ts / voice-mode.ts # Voice capture orchestrator
dataset/
formatted/ # Training data (also pushed to HF Hub)
grounding-cases.json # Grounding benchmark cases
grounding-fixtures/ # HTML fixtures for the grounding benchmark
releases/ # Frozen release manifests (one JSON per bundle)
outputs/ # gitignored
eval/ # eval reports + cached predictions
grounding/ # grounding benchmark reports
onnx/ # ONNX model (after conversion)
hf-cache/ # cached HF models for local inference
vendor/
devtools-protocol/ # submodule — CDP protocol JSON spec
unslothai-notebooks/ # submodule — fine-tuning notebook references
cerebriumai-examples/ # submodule — deployment examples
- Single-step only. The model emits one tool call per inference.
src/split-prompt.tshandles compound prompts as a pre-model pass but won't infer implicit sequencing ("search for github" = focus + type + Enter). A multi-step fine-tune would require populatingdataset/formatted/multi-step.json. - No iframe / shadow-DOM grounding. Cookie consent banners, Stripe checkouts, embedded widgets are invisible to the grounding layer today. Plan in
src/grounding.tsunder "FUTURE". - Sweep is parallel random search, not Bayesian. See the Sweep section above.
- Release pinning is partial.
serve,infer,tryhonour--release;eval-execand remote training don't. - 281-example test set. 95% CI around a 0.95 proportion is roughly ±2.5%. Sub-1-point differences between runs are inside the noise floor.