Vision Control Skill

Vision-based desktop control for AI agents — interact with any application through visual understanding, not DOM inspection.

Why Vision Control?

Unlike browser automation tools (Playwright, Puppeteer, Chrome MCP), Vision Control:

Works with ANY application — desktop apps, games, legacy software, not just web browsers
No DOM/API required — controls apps that lack programmatic interfaces
Visual reasoning — AI sees what users see through grid-annotated screenshots
True desktop automation — native OS-level mouse/keyboard control across all windows

Use Playwright/Chrome MCP when: You need to automate web browsers with DOM access, network interception, or headless testing.

Use Vision Control when: You need to automate desktop applications, legacy software, or any GUI that lacks APIs.

How It Works

Capture — Screenshot with 26×15 grid overlay (A-Z columns, 1-15 rows)
Locate — AI identifies targets by grid coordinates (S3, M8/2)
Act — Execute mouse clicks, drags, scrolls, keyboard input at precise locations

# Capture screen with grid
python scripts/capture_grid.py

# Click at grid coordinate
python scripts/mouse_action.py click S3

# Type text
python scripts/keyboard_input.py type "Hello World"

Grid Coordinates

Screen divided into 26×15 grid (A-Z columns × 1-15 rows). Coordinates like S3 target cell centers, S3/2 targets quadrant 2 (top-right).

Quadrants:  ┌───┬───┐
            │ 1 │ 2 │
            ├───┼───┤
            │ 3 │ 4 │
            └───┴───┘

Actions

Mouse: click, right, double, move, drag, scroll
Keyboard: type, press, hotkey, click-type

See references/ for detailed command syntax.

Safety

⚠️ Performs real system actions — verify coordinates before executing. Screenshots auto-delete for privacy.

License

MIT License — see LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
skills/vision-control-skill		skills/vision-control-skill
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Control Skill

Why Vision Control?

How It Works

Grid Coordinates

Actions

Safety

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Control Skill

Why Vision Control?

How It Works

Grid Coordinates

Actions

Safety

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages