Screenshot → vision model → xdotool action → repeat.
A minimal vision-action loop for GUI automation on Linux. Give it a goal in plain English, it takes a screenshot, asks a vision model what to click or type, does it, and repeats until the goal is met or it gives up.
- Takes a screenshot of your X display
- Sends it to a vision model (local Ollama or Claude via API)
- Executes the returned action via xdotool (click, type, scroll, key press)
- Loops until the goal is complete, fails, or reaches the step limit
- Falls back from local model to Claude automatically in
--vision automode - Supports dry-run mode to preview actions without executing them
- Linux with an X session
- Python 3.10+
xdotool— for mouse/keyboard control- One of:
imagemagick(for screenshots viaimport)scrot(fallback)
- One of:
ANTHROPIC_API_KEYset (for cloud vision via Claude)- Local Ollama with a vision model (e.g.
gemma4:e2b) for local vision
sudo apt install xdotool imagemagick
pip install anthropic# click a button
python3 desktop.py "click the Submit button"
# fill in a form field
python3 desktop.py "fill in the username field with 'alice'"
# just describe what's on screen
python3 desktop.py --screenshot-only
# use only cloud (Claude) vision
python3 desktop.py --vision cloud "close the dialog box"
# use only local model (Ollama)
python3 desktop.py --vision local "click OK"
# dry run — see what it would do without doing it
python3 desktop.py --dry-run "open the terminal"
# save screenshots from each step
python3 desktop.py --save-screenshots "log in"--max-steps N Max action steps before giving up (default: 20)
--display :0 X display to use (default: $DISPLAY or :0)
--screenshot-only Describe the screen and exit
--dry-run Print actions without executing
--save-screenshots Save each step screenshot to /tmp/desktop-step-N.png
--delay SECS Wait this long after each action (default: 1.0)
--vision MODE Vision backend: auto, local, cloud (default: auto)
auto (default): tries local Ollama first, falls back to Claude if it fails or is unavailable.
local: uses Ollama with a vision model. Expects gemma4:e2b by default, or whatever model is on port 11434. Fully offline.
cloud: uses Claude via the Anthropic API (ANTHROPIC_API_KEY). More capable, costs tokens.
Each step:
- Takes a screenshot
- Sends it to the vision model with the goal and history of prior steps
- The model returns a JSON action (
click,type,key,scroll,move,wait,done,fail) xdotoolexecutes the action- Repeat
The model sees the full action history each step, so it knows what it already tried.
- Requires a running X session (no Wayland support yet)
- Coordinate accuracy depends on the vision model — local models are less precise than Claude
- Not designed for speed; there's a configurable delay between steps
- No automatic captcha solving or anti-bot bypass
MIT