Skip to content

kithfoss/desktop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

desktop

Screenshot → vision model → xdotool action → repeat.

A minimal vision-action loop for GUI automation on Linux. Give it a goal in plain English, it takes a screenshot, asks a vision model what to click or type, does it, and repeats until the goal is met or it gives up.


What it does

  • Takes a screenshot of your X display
  • Sends it to a vision model (local Ollama or Claude via API)
  • Executes the returned action via xdotool (click, type, scroll, key press)
  • Loops until the goal is complete, fails, or reaches the step limit
  • Falls back from local model to Claude automatically in --vision auto mode
  • Supports dry-run mode to preview actions without executing them

Requirements

  • Linux with an X session
  • Python 3.10+
  • xdotool — for mouse/keyboard control
  • One of:
    • imagemagick (for screenshots via import)
    • scrot (fallback)
  • One of:
    • ANTHROPIC_API_KEY set (for cloud vision via Claude)
    • Local Ollama with a vision model (e.g. gemma4:e2b) for local vision
sudo apt install xdotool imagemagick
pip install anthropic

Usage

# click a button
python3 desktop.py "click the Submit button"

# fill in a form field
python3 desktop.py "fill in the username field with 'alice'"

# just describe what's on screen
python3 desktop.py --screenshot-only

# use only cloud (Claude) vision
python3 desktop.py --vision cloud "close the dialog box"

# use only local model (Ollama)
python3 desktop.py --vision local "click OK"

# dry run — see what it would do without doing it
python3 desktop.py --dry-run "open the terminal"

# save screenshots from each step
python3 desktop.py --save-screenshots "log in"

Options

--max-steps N        Max action steps before giving up (default: 20)
--display :0         X display to use (default: $DISPLAY or :0)
--screenshot-only    Describe the screen and exit
--dry-run            Print actions without executing
--save-screenshots   Save each step screenshot to /tmp/desktop-step-N.png
--delay SECS         Wait this long after each action (default: 1.0)
--vision MODE        Vision backend: auto, local, cloud (default: auto)

Vision backends

auto (default): tries local Ollama first, falls back to Claude if it fails or is unavailable.

local: uses Ollama with a vision model. Expects gemma4:e2b by default, or whatever model is on port 11434. Fully offline.

cloud: uses Claude via the Anthropic API (ANTHROPIC_API_KEY). More capable, costs tokens.


How it works

Each step:

  1. Takes a screenshot
  2. Sends it to the vision model with the goal and history of prior steps
  3. The model returns a JSON action (click, type, key, scroll, move, wait, done, fail)
  4. xdotool executes the action
  5. Repeat

The model sees the full action history each step, so it knows what it already tried.


Limitations

  • Requires a running X session (no Wayland support yet)
  • Coordinate accuracy depends on the vision model — local models are less precise than Claude
  • Not designed for speed; there's a configurable delay between steps
  • No automatic captcha solving or anti-bot bypass

License

MIT

About

Screenshot → vision model → xdotool action → repeat. GUI automation loop for Linux.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors