Skip to content

askiichan/vision-control-skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Vision Control Skill

Python 3.8+ License: MIT

Vision-based desktop control for AI agents — interact with any application through visual understanding, not DOM inspection.

Why Vision Control?

Unlike browser automation tools (Playwright, Puppeteer, Chrome MCP), Vision Control:

  • Works with ANY application — desktop apps, games, legacy software, not just web browsers
  • No DOM/API required — controls apps that lack programmatic interfaces
  • Visual reasoning — AI sees what users see through grid-annotated screenshots
  • True desktop automation — native OS-level mouse/keyboard control across all windows

Use Playwright/Chrome MCP when: You need to automate web browsers with DOM access, network interception, or headless testing.

Use Vision Control when: You need to automate desktop applications, legacy software, or any GUI that lacks APIs.

How It Works

  1. Capture — Screenshot with 26×15 grid overlay (A-Z columns, 1-15 rows)
  2. Locate — AI identifies targets by grid coordinates (S3, M8/2)
  3. Act — Execute mouse clicks, drags, scrolls, keyboard input at precise locations
# Capture screen with grid
python scripts/capture_grid.py

# Click at grid coordinate
python scripts/mouse_action.py click S3

# Type text
python scripts/keyboard_input.py type "Hello World"

Grid Coordinates

Screen divided into 26×15 grid (A-Z columns × 1-15 rows). Coordinates like S3 target cell centers, S3/2 targets quadrant 2 (top-right).

Quadrants:  ┌───┬───┐
            │ 1 │ 2 │
            ├───┼───┤
            │ 3 │ 4 │
            └───┴───┘

Actions

Mouse: click, right, double, move, drag, scroll
Keyboard: type, press, hotkey, click-type

See references/ for detailed command syntax.

Safety

⚠️ Performs real system actions — verify coordinates before executing. Screenshots auto-delete for privacy.

License

MIT License — see LICENSE file.

About

agent skill that let AI see and control your screen - Grid-based visual automation for LLM agents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages