Skip to content

rishabhsai/glance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Glance

npm version PyPI version License: MIT

Structured screen understanding for AI apps. glance.run

Every AI screen companion sends a full 2MB screenshot to a vision model on every interaction. It's slow, expensive, and the AI has to guess where UI elements are.

Glance reads the macOS accessibility tree instead — giving your AI structured text with every element's role, label, value, and exact pixel coordinates. One function call. ~30ms. No screenshots.

Why

Screenshot Glance
Speed 2–5 seconds ~0.8 seconds
Tokens ~3,000 (image) ~500 (text)
Cost ~$0.03 / query ~$0.001 / query
Positions AI guesses from pixels Exact from OS
Upload 2MB image to cloud Nothing — local text

Text works great for ~45% of interactions (native apps, terminals, editors, Electron apps). For browsers, canvas apps, and games, combine with screenshots. Your AI doesn't need to see the screen most of the time — it just needs to know what's on it.

Install

# JavaScript / TypeScript
npm install glance-sdk

# Python
pip install glance-sdk

# Swift — add to Package.swift
.package(url: "https://github.com/rishabhsai/glance", from: "0.1.0")

Usage

import { screen, capture, find } from 'glance-sdk'

// LLM-ready text — drop directly into your prompt
const context = await screen()

// Full structured data
const state = await capture()
console.log(state.app)           // "Safari"
console.log(state.elementCount)  // 342
console.log(state.captureTimeMs) // 47.2

// Find element by name → exact pixel coordinates
const btn = await find('Submit')
console.log(btn.centerX, btn.centerY)  // 520, 340
from glance_sdk import screen, capture, find

context = screen()       # LLM-ready text
state = capture()        # full structured data
btn = find("Submit")     # exact position
import Glance

let context = try Glance.screen()         // LLM-ready text
let state = try Glance.capture()          // structured data
let btn = try Glance.find("Submit")       // exact position
print(btn?.center)                        // CGPoint(x: 520, y: 340)

What your LLM receives

Instead of a 2MB image, your AI gets structured text like this (~500 tokens):

[App: DaVinci Resolve 19.1 | Window: "Project 1 - Edit"]

## Focused
- [Slider] "Midtones" value=0.32 at (510,390) [FOCUSED]

## Controls
- [Button] "Cut" at (120,42)
- [Button] "Color" at (680,42)
- [PopUpButton] "Node" value="Corrector 1" at (820,42)

## Input Fields
- [Slider] "Lift" value=0.15 at (400,380)
- [Slider] "Gamma" value=-0.08 at (560,380)
- [Slider] "Gain" value=0.22 at (720,380)

The coordinates are exact — from the OS, not guessed by a vision model.

Smart fallback

Some apps don't expose accessibility data. Detect this and fall back:

const state = await capture()

if (state.elementCount > 5) {
  // structured text works — fast and cheap
  sendToLLM({ role: 'user', content: state.prompt })
} else {
  // canvas app — fall back to screenshot
  const img = await captureScreenshot()
  sendToLLM({ role: 'user', content: [{ type: 'image', data: img }] })
}

What it works with

Great for (~45%): VS Code, Cursor, Zed, Xcode, Ghostty, iTerm2, Warp, Terminal, Slack, Discord, Notion, Notes, Mail, Finder, System Settings, all Electron apps

Partial (~30%): Browsers (Chrome, Safari, Arc) — toolbar and tabs yes, web content varies by site. DaVinci Resolve, Adobe apps, Figma, Blender — menus yes, canvas no

Use screenshot (~25%): Games, Canvas/WebGL, deep web page content, remote desktop

CLI

The CLI outputs structured data to stdout — use from any language:

glance                     # LLM-ready text
glance screen --json       # full JSON
glance find "Submit"       # find element position
glance check               # verify accessibility permission

API

Four functions:

  • screen() → LLM-ready string with all elements and positions
  • capture() → full structured object (app, window, elements, metrics)
  • find(name) → element by label with exact pixel coordinates
  • checkAccess() → boolean, check if accessibility permission is granted

How it works

Glance reads the macOS accessibility tree (AXUIElement API) — the same structured data VoiceOver uses. Every app exposes its UI elements through this tree: buttons, text fields, sliders, links, with their labels, values, states, and positions.

The SDK reads this tree in ~30ms, formats it into LLM-optimized text, and returns it. No screenshots captured, no images uploaded, no vision model needed.

Permission: Requires macOS Accessibility permission (System Settings → Privacy & Security → Accessibility). Same permission Clicky and similar tools already need.

License

MIT

About

Structured screen understanding for AI apps. Replace screenshots with instant, accurate UI data — 10× faster, 30× cheaper.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors