Skip to content

perf(visual): skip re-OCR of unchanged pixels, right-size the OCR input, pin the clipboard preface#685

Merged
FuJacob merged 3 commits into
mainfrom
perf/visual-context-efficiency
Jun 12, 2026
Merged

perf(visual): skip re-OCR of unchanged pixels, right-size the OCR input, pin the clipboard preface#685
FuJacob merged 3 commits into
mainfrom
perf/visual-context-efficiency

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Three efficiency cuts in the visual-context pipeline, sized from the energy-audit follow-up (the OCR lane was the one remaining tier-B lever):

  • Pixel-hash extraction cache. Refocusing a window re-ran the full Vision pass even when the captured pixels were identical (alt-tab away and back is the common case). A small bounded cache keyed by an FNV-1a stride hash of the capture now reuses the raw extraction; hygiene, normalization, and the field-text stripping still rerun against the live field text, so a hit stays byte-identical to re-OCRing the same pixels.
  • Downscale cap 1600 to 1200. The Retina capture of the 700pt strip arrives above both caps, so this only changes how much gets handed to Vision: ~44% fewer pixels per accurate-mode pass while typical 11-13pt UI text stays comfortably above the recognition floor (~1.2 px/pt on the strip). The recognition level itself intentionally stays .accurate: measured data puts .fast at only ~1.6x on Apple Silicon with a real recall cost.
  • Clipboard preface pinning. The clipboard relevance verdict was re-evaluated against the live prefix on every request, and the clipboard section precedes the typed prefix in the prompt, so every verdict flip rewrote the prompt head and collapsed the llama engine's reusable KV common prefix into a full re-prefill. An accepted verdict is now pinned per (field session, pasteboard change count); a nil verdict keeps re-evaluating because it adds nothing to the prompt (head-stable) and the clipboard may only become relevant as more text is typed.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing \
  -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/ScreenshotContextGeneratorTests \
  -only-testing:CotabbyTests/VisualContextStartCoalescerTests \
  -only-testing:CotabbyTests/SuggestionCoordinatorAcceptanceTests
# 0 failures (wall time inflated by an unrelated local disk-pressure incident during the run)

swiftlint lint --quiet
# exit 0

New test: test_generateContext_reusesExtractionForIdenticalPixels (counting extractor proves the Vision pass is skipped and the excerpt is identical).

Linked issues

Refs #661

Risk / rollout notes

  • A stride-hash collision would reuse OCR text for a window whose pixels changed; the stride still touches every row and any text change moves antialiased pixels broadly, so this is vanishingly unlikely, and the blast radius is one stale excerpt for one field session.
  • The 1200px cap is a measured-tradeoff default, not a hard floor; if small-text recall regresses in practice the constant is one line.
  • Pinning changes when clipboard context can ENTER a session's prompts (it always could before via flips); it cannot change what the relevance filter accepts. A new copy re-evaluates immediately.
  • Follow-up candidates deliberately not in this PR: battery-aware capture policy via the existing power-profile machinery, and raising minimumTextHeight with measurement.

🤖 Generated with Claude Code

Greptile Summary

Three targeted efficiency improvements to the visual-context pipeline: a bounded FNV-1a pixel-hash cache that skips re-OCR when captured pixels are unchanged, a downscale cap reduction from 1600 → 1200 px (≈44% fewer pixels per Vision pass), and clipboard-relevance pinning that stabilises the prompt head so the engine's KV common prefix survives across keystrokes.

  • Pixel-hash cache (ScreenshotContextGenerator): stride-17 FNV-1a hash gates the Vision pass; finishedExcerpt is refactored into a shared helper so hygiene and field-text stripping rerun on every call regardless of cache hit. The noRecognizedText/windowTitle fallback path does not populate the cache (noted in a previous review thread).
  • 1200 px OCR cap (VisualContextModels): measured-tradeoff constant with inline rationale; one-line revert if small-text recall regresses.
  • Clipboard preface pinning (SuggestionCoordinator+Prediction): non-nil verdicts are pinned per (focusChangeSequence, changeCount); nil verdicts keep re-evaluating. handleSuggestionSettingsChange does not clear the memo, so a pinned verdict from one engine can survive an engine switch within the same field session.

Confidence Score: 5/5

Safe to merge; all three optimizations are narrowly scoped, each with a documented worst-case blast radius of one stale value for one field session.

Changes are well-contained: the pixel-hash cache degrades gracefully (nil hash disables it, stride-17 was corrected in the follow-up commit), the 1200 px cap is a single constant, and the clipboard memo self-heals on focus change or clipboard change. The only gap is that the memo is not invalidated on engine/settings change, which could leave a stale relevance verdict for the remainder of a field session — a minor behavioural edge case with no data-loss or security implications.

Cotabby/App/Coordinators/SuggestionCoordinator+Lifecycle.swift — handleSuggestionSettingsChange does not clear clipboardPrefaceMemo.

Important Files Changed

Filename Overview
Cotabby/Services/Visual/ScreenshotContextGenerator.swift Adds FNV-1a pixel-hash extraction cache (stride 17 to sample all channels) and refactors finishedExcerpt into a shared helper; noRecognizedText/windowTitle path skips storeExtraction so that branch never warms the cache.
Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift Extracts clipboard resolution into pinnedClipboardContext with correct (focusSequence, changeCount) keying; nil verdicts keep re-evaluating, non-nil are pinned. Memo is not cleared on settings change.
Cotabby/App/Coordinators/SuggestionCoordinator.swift Adds ClipboardPrefaceMemo struct and clipboardPrefaceMemo property; straightforward data model change with clear documentation.
Cotabby/Models/VisualContextModels.swift Lowers maxImageDimension default from 1600 to 1200; well-documented tradeoff with measured rationale in comments.
CotabbyTests/ScreenshotContextGeneratorTests.swift Adds test_generateContext_reusesExtractionForIdenticalPixels with a CountingTextExtractor stub; correctly validates Vision-pass is skipped and output is byte-identical on cache hit.
CotabbyTests/PermissionAndContextModelTests.swift Updates maxImageDimension expectation to 1200 with explanatory comment; trivial alignment change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[generateContext called] --> B[captureScreenshot]
    B --> C[pixelHash of image]
    C --> D{hash in cache?}
    D -- yes --> E[finishedExcerpt\nhygiene + field-text strip]
    D -- no --> F[textExtractor.extractText\nVision OCR pass]
    F --> G{noRecognizedText?}
    G -- yes, windowTitle exists --> H[return title excerpt\ncache NOT populated]
    G -- no --> I[storeExtraction in cache]
    I --> E
    E --> J[VisualContextExcerpt]

    K[pinnedClipboardContext] --> L{settings enabled?}
    L -- no --> M[return nil]
    L -- yes --> N{memo hit?\nfocusSeq + changeCount match\nvalue != nil}
    N -- yes --> O[return pinned value]
    N -- no --> P[truncatedPromptPrefix\nclipboardRelevanceFilter]
    P --> Q[store ClipboardPrefaceMemo]
    Q --> R[return value]
Loading

Comments Outside Diff (1)

  1. Cotabby/Services/Visual/ScreenshotContextGenerator.swift, line 79-98 (link)

    P2 noRecognizedText branch never populates the cache

    When Vision finds no text but a valid windowTitle is present, the function returns early at line 98 without calling storeExtraction. The pixelHash was computed, but the cache is never populated for this path. On the next call with identical pixels — the exact alt-tab scenario the PR targets — the cache check misses and Vision runs again. Applications like Figma or image-heavy UIs that commonly land in this branch get no benefit from the pixel-hash cache.

    A lightweight fix would be to store a sentinel ExtractedScreenText (e.g., empty lines) before the early return so the key is recorded; alternatively, the windowTitle path could be cached separately.

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (3): Last reviewed commit: "review: stride the pixel hash coprime wi..." | Re-trigger Greptile

…ut, pin the clipboard preface

Three visual-context efficiency cuts. Refocusing a window re-ran the
full Vision pass even when the captured pixels were identical; a small
pixel-hash cache now reuses the raw extraction while hygiene and
bounding still rerun against the live field text, so a hit stays
byte-identical to re-OCRing the same pixels. The pre-OCR downscale cap
drops from 1600 to 1200 (the Retina capture of the 700pt strip exceeds
both caps, and 1200 keeps UI text well above Vision's recognition floor
while cutting the Vision workload ~44%). And the clipboard relevance
verdict, which was re-evaluated against the live prefix on every
request, is now pinned per field session once accepted: the clipboard
section precedes the typed prefix in the prompt, so each verdict flip
rewrote the prompt head and collapsed the engine's reusable KV common
prefix into a full re-prefill.
Comment thread Cotabby/Services/Visual/ScreenshotContextGenerator.swift
@FuJacob FuJacob merged commit 802a291 into main Jun 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant