Vision-based desktop control for AI agents — interact with any application through visual understanding, not DOM inspection.
Unlike browser automation tools (Playwright, Puppeteer, Chrome MCP), Vision Control:
- Works with ANY application — desktop apps, games, legacy software, not just web browsers
- No DOM/API required — controls apps that lack programmatic interfaces
- Visual reasoning — AI sees what users see through grid-annotated screenshots
- True desktop automation — native OS-level mouse/keyboard control across all windows
Use Playwright/Chrome MCP when: You need to automate web browsers with DOM access, network interception, or headless testing.
Use Vision Control when: You need to automate desktop applications, legacy software, or any GUI that lacks APIs.
- Capture — Screenshot with 26×15 grid overlay (A-Z columns, 1-15 rows)
- Locate — AI identifies targets by grid coordinates (
S3,M8/2) - Act — Execute mouse clicks, drags, scrolls, keyboard input at precise locations
# Capture screen with grid
python scripts/capture_grid.py
# Click at grid coordinate
python scripts/mouse_action.py click S3
# Type text
python scripts/keyboard_input.py type "Hello World"Screen divided into 26×15 grid (A-Z columns × 1-15 rows). Coordinates like S3 target cell centers, S3/2 targets quadrant 2 (top-right).
Quadrants: ┌───┬───┐
│ 1 │ 2 │
├───┼───┤
│ 3 │ 4 │
└───┴───┘
Mouse: click, right, double, move, drag, scroll
Keyboard: type, press, hotkey, click-type
See references/ for detailed command syntax.
MIT License — see LICENSE file.