desktop

Screenshot → vision model → xdotool action → repeat.

A minimal vision-action loop for GUI automation on Linux. Give it a goal in plain English, it takes a screenshot, asks a vision model what to click or type, does it, and repeats until the goal is met or it gives up.

What it does

Takes a screenshot of your X display
Sends it to a vision model (local Ollama or Claude via API)
Executes the returned action via xdotool (click, type, scroll, key press)
Loops until the goal is complete, fails, or reaches the step limit
Falls back from local model to Claude automatically in --vision auto mode
Supports dry-run mode to preview actions without executing them

Requirements

Linux with an X session
Python 3.10+
xdotool — for mouse/keyboard control
One of:
- imagemagick (for screenshots via import)
- scrot (fallback)
One of:
- ANTHROPIC_API_KEY set (for cloud vision via Claude)
- Local Ollama with a vision model (e.g. gemma4:e2b) for local vision

sudo apt install xdotool imagemagick
pip install anthropic

Usage

# click a button
python3 desktop.py "click the Submit button"

# fill in a form field
python3 desktop.py "fill in the username field with 'alice'"

# just describe what's on screen
python3 desktop.py --screenshot-only

# use only cloud (Claude) vision
python3 desktop.py --vision cloud "close the dialog box"

# use only local model (Ollama)
python3 desktop.py --vision local "click OK"

# dry run — see what it would do without doing it
python3 desktop.py --dry-run "open the terminal"

# save screenshots from each step
python3 desktop.py --save-screenshots "log in"

Options

--max-steps N        Max action steps before giving up (default: 20)
--display :0         X display to use (default: $DISPLAY or :0)
--screenshot-only    Describe the screen and exit
--dry-run            Print actions without executing
--save-screenshots   Save each step screenshot to /tmp/desktop-step-N.png
--delay SECS         Wait this long after each action (default: 1.0)
--vision MODE        Vision backend: auto, local, cloud (default: auto)

Vision backends

auto (default): tries local Ollama first, falls back to Claude if it fails or is unavailable.

local: uses Ollama with a vision model. Expects gemma4:e2b by default, or whatever model is on port 11434. Fully offline.

cloud: uses Claude via the Anthropic API (ANTHROPIC_API_KEY). More capable, costs tokens.

How it works

Each step:

Takes a screenshot
Sends it to the vision model with the goal and history of prior steps
The model returns a JSON action (click, type, key, scroll, move, wait, done, fail)
xdotool executes the action
Repeat

The model sees the full action history each step, so it knows what it already tried.

Limitations

Requires a running X session (no Wayland support yet)
Coordinate accuracy depends on the vision model — local models are less precise than Claude
Not designed for speed; there's a configurable delay between steps
No automatic captcha solving or anti-bot bypass

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
desktop.py		desktop.py
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

desktop

What it does

Requirements

Usage

Options

Vision backends

How it works

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

desktop

What it does

Requirements

Usage

Options

Vision backends

How it works

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages