Improve OCR text extraction quality (dark UIs, resolution, extraction logic)
Problem
The current OCR pipeline misses most text on dark-themed applications (WhatsApp, Slack, Discord, etc.). On a WhatsApp conversation with dozens of visible messages, agent-watch only captured ~80 characters (menu bar text: "File Edit Chat Call View Window Help").
Root causes identified
-
NativeTextExtractor short-circuits OCR: The accessibility extractor runs first. If it returns ≥ minimumAccessibilityChars (even just menu bar items), OCR is skipped entirely. For WhatsApp, accessibility returns ~100 chars of sidebar/menu text, satisfying the minimum — so the Vision framework OCR never runs on the actual message content.
-
Frame buffer resolution too low: FrameBufferStore downscales captures to maxDimension = 1280, which halves Retina resolution (2560 → 1280). Text becomes too small for reliable OCR, especially in dense UIs.
-
Apple Vision framework struggles with dark themes: VNRecognizeTextRequest performs poorly on light-on-dark text. The Vision framework was designed primarily for document scanning (dark text on light backgrounds).
Changes
1. NativeTextExtractor.swift — Always run both extractors, keep the best
Before: Accessibility runs first; if it returns enough chars, OCR is skipped.
After: Both accessibility AND OCR always run. The result with more text wins.
// Before
if let accessibilityText = accessibilityExtractor.extractText(),
accessibilityText.count >= minimumAccessibilityChars {
return ExtractedText(text: accessibilityText, source: .accessibility, metadata: metadata)
}
// OCR only runs as fallback
// After
let accessibilityText = accessibilityExtractor.extractText()
var ocrText: String? = nil
if ocrEnabled {
ocrText = try ocrExtractor.extractText()
}
// Return whichever extracted more text
if ocrLen > accLen { return ocr } else { return accessibility }
2. FrameBufferStore.swift — Increase resolution to full Retina
// Before
maxDimension: Int = 1280
// After
maxDimension: Int = 2560
Disk impact: frames go from ~250KB to ~400-800KB. With the existing retention/pruning policy this remains well under control.
3. OCRTextExtractor.swift — Color inversion for dark themes
Runs OCR twice: once on the original image, once on a color-inverted version (using CoreImage CIColorInvert). Keeps whichever result contains more text. Also lowered minimumTextHeight from 0.005 to 0.002 to catch smaller text.
Results
| Metric |
Before |
After |
| WhatsApp text captured |
~80 chars (menu bar only) |
1567 chars (all messages, contacts, timestamps, links) |
| Frame resolution |
1280×831 |
2560×1662 |
| text_source for WhatsApp |
accessibility (short-circuited) |
ocr (full Vision + inversion) |
Environment
- macOS 15 (Tahoe)
- MacBook Pro M-series (Retina display)
- WhatsApp desktop, Slack, Discord (dark theme)
Related
Improve OCR text extraction quality (dark UIs, resolution, extraction logic)
Problem
The current OCR pipeline misses most text on dark-themed applications (WhatsApp, Slack, Discord, etc.). On a WhatsApp conversation with dozens of visible messages, agent-watch only captured ~80 characters (menu bar text: "File Edit Chat Call View Window Help").
Root causes identified
NativeTextExtractorshort-circuits OCR: The accessibility extractor runs first. If it returns ≥minimumAccessibilityChars(even just menu bar items), OCR is skipped entirely. For WhatsApp, accessibility returns ~100 chars of sidebar/menu text, satisfying the minimum — so the Vision framework OCR never runs on the actual message content.Frame buffer resolution too low:
FrameBufferStoredownscales captures tomaxDimension = 1280, which halves Retina resolution (2560 → 1280). Text becomes too small for reliable OCR, especially in dense UIs.Apple Vision framework struggles with dark themes:
VNRecognizeTextRequestperforms poorly on light-on-dark text. The Vision framework was designed primarily for document scanning (dark text on light backgrounds).Changes
1.
NativeTextExtractor.swift— Always run both extractors, keep the bestBefore: Accessibility runs first; if it returns enough chars, OCR is skipped.
After: Both accessibility AND OCR always run. The result with more text wins.
2.
FrameBufferStore.swift— Increase resolution to full RetinaDisk impact: frames go from ~250KB to ~400-800KB. With the existing retention/pruning policy this remains well under control.
3.
OCRTextExtractor.swift— Color inversion for dark themesRuns OCR twice: once on the original image, once on a color-inverted version (using CoreImage
CIColorInvert). Keeps whichever result contains more text. Also loweredminimumTextHeightfrom 0.005 to 0.002 to catch smaller text.Results
Environment
Related