gui-agent

Pass
ALL interactions with ANY app — whether built-in (Finder, Safari, System Settings) or third-party (WeChat, Chrome, Slack) — MUST go through this skill. Clicking, typing, reading content, sending messages, navigating menus, filling forms: everything uses visual detection (screenshot → template match → click). This is the ONLY way to operate apps. Never bypass with CLI commands, AppleScript, or Accessibility APIs.
@FzkujiMIT3/27/2026
79out of 100
(0)
14stars
4downloads
19views
Install Skill

Skills are third-party code from public GitHub repositories. SkillHub scans for known malicious patterns but cannot guarantee safety. Review the source code before installing.
Install with CLI
Install globally (user-level):
npx skillhub install Fzkuji/GUI-Agent-Skills/gui-agent
Install in current project:
npx skillhub install Fzkuji/GUI-Agent-Skills/gui-agent --project
Suggested path: ~/.claude/skills/gui-agent/
AI Review

Instruction Quality80
Description Precision77
Usefulness82
Technical Soundness74
Sophisticated GUI automation skill backed by benchmark evidence and comprehensive architecture. The split visual memory model (permanent knowledge vs. machine-specific captures) and platform-specific action definitions show engineering maturity.
SKILL.md Content

---
name: gui-agent
description: "ALL interactions with ANY app — whether built-in (Finder, Safari, System Settings) or third-party (WeChat, Chrome, Slack) — MUST go through this skill. Clicking, typing, reading content, sending messages, navigating menus, filling forms: everything uses visual detection (screenshot → template match → click). This is the ONLY way to operate apps. Never bypass with CLI commands, AppleScript, or Accessibility APIs."
---

# GUI Agent Skill

## 🔴 VISION vs COMMAND — When to Use What (READ FIRST)

Every GUI task involves two kinds of operations. **Know the boundary.**

### MUST be vision-based (screenshot → detect → act)
- **Determining current state** — "What page am I on? What's visible?"
- **Locating click targets** — buttons, links, menu items, icons → coordinates MUST come from GPA-GUI-Detector / OCR / template matching
- **Verifying results** — "Did my action work? Did the page change?"
- **Handling unexpected UI** — popups, cookie banners, error pages, CAPTCHA
- **Reading content** — extracting text/data from the screen
- **Any spatial decision** — "where on screen is X?"

### MAY use keyboard shortcuts / CLI commands (non-visual)
- **Keyboard shortcuts** — Ctrl+L (address bar), Ctrl+T (new tab), Ctrl+W (close tab), Ctrl+C/V (copy/paste), Page Down (scroll), etc.
- **Text input** — typing URLs, search queries, form values (pyautogui.typewrite / hotkey)
- **System commands** — launching apps, setting resolution (xrandr), checking processes

### ⚠️ THE RULE: Decision = Visual, Execution = Best Tool
```
✅ CORRECT workflow:
   1. Screenshot → detect/OCR → understand current state (VISUAL)
   2. Decide what to do next based on what you SEE (VISUAL)
   3. Execute: click detected coordinates OR use keyboard shortcut (BEST TOOL)
   4. Screenshot → verify the result (VISUAL)

❌ WRONG workflows:
   - Skip observation, go straight to keyboard commands (no visual basis)
   - Know the answer beforehand, type it without looking (not agent behavior)
   - Use CLI to navigate instead of interacting with the UI
   - Chain multiple actions without visual verification between them
```

### Examples
```
✅ "I see Chrome is open on United Airlines homepage" → screenshot confirms this
   → "I see 'Travel info' in the nav bar at (661, 188) from OCR" → click (661, 188)
   → screenshot → "dropdown opened, I see 'Baggage' link at (650, 250)" → click

❌ "I know the URL is united.com/en/us/checked-bag-fee-calculator"
   → Ctrl+L → type URL → Enter → done
   (No visual observation drove the decision — this is command-line with extra steps)

✅ "I see I'm in Chrome" (visual) → Ctrl+L to focus address bar (shortcut is fine)
   → "I need to search for baggage calculator" → type search query (input is fine)
   → screenshot → verify results (visual)
   (Visual observation → shortcut for efficiency → visual verification)
```

**Bottom line: You must LOOK before you ACT. Every action must be justified by what you observed on screen. Shortcuts are tools for execution, not substitutes for observation.**

### 🔍 Three Visual Methods — When to Use Each

You have three ways to "see" the screen. They serve different purposes. **Do not mix up their roles.**

#### Method 1: OCR (`detect_text`)
- **What it does**: Uses Apple Vision framework to read all text on screen
- **Returns**: Each text element with: `label` (the text), `cx`/`cy` (center coordinates), `x`/`y`/`w`/`h` (bounding box)
- **Use when**: Finding a specific text label, link, menu item, button with text, or any UI element that has readable text
- **Strengths**: Precise text content + exact coordinates; most UI elements have text labels so this works for the majority of cases
- **Limitations**: Cannot detect non-text elements (icons without labels, graphical buttons, images)
- **✅ Provides click coordinates**: YES — use `cx`, `cy` from the result to click

#### Method 2: GPA-GUI-Detector (`detect_icons`)
- **What it does**: Runs a YOLO-based UI element detection model (Salesforce/GPA-GUI-Detector)
- **Returns**: Each detected UI component with: `cx`/`cy` (center coordinates), `x`/`y`/`w`/`h` (bounding box), `confidence` score. Label is always `null` (it detects position/shape only, not semantics)
- **Use when**: Finding buttons, icons, checkboxes, input fields, or other UI components that are identifiable by their shape/position rather than text
- **Strengths**: Finds all interactive elements regardless of whether they have text; good for icon-only buttons (hamburger menu, close button, three-dot menu, etc.)
- **Limitations**: No semantic labels — you get bounding boxes but don't know WHAT each box is. Must combine with OCR or image tool to identify which box is which
- **✅ Provides click coordinates**: YES — use `cx`, `cy` from the result to click

#### Method 3: image tool (LLM vision)
- **What it does**: Sends a screenshot to the LLM for visual understanding
- **Returns**: Natural language description of what's on screen — page layout, element meanings, spatial relationships, current state
- **Use when**: You need to UNDERSTAND the screen — "What page is this?", "What does this dialog mean?", "Which of the detected elements is the one I need?", "What should my next step be?"
- **Strengths**: Semantic understanding, can interpret complex layouts, read visual context that OCR/detector miss
- **Limitations**: ⛔ **NEVER provides reliable coordinates**. The LLM may describe positions ("top right corner", "third button") but these are ESTIMATES, not measured coordinates. NEVER use positions from image tool output for clicking.
- **⛔ Does NOT provide click coordinates**: NO — NEVER extract coordinates from image tool responses. ALWAYS go back to OCR/detector results for the actual click position.

#### Workflow: Unfamiliar → Familiar (progressive)

**Phase 1: First encounter / unfamiliar page (DEFAULT)**

Use all three methods together. This is the starting point for any new page or uncertain situation.

```
Step 1: Take screenshot
Step 2: Run OCR (detect_text) on the screenshot
        → get all text elements with their coordinates
        → read the output: you now know what text is on screen and where
Step 3: Send the screenshot to image tool
        → LLM sees the page visually
        → understand: what page is this? what's the layout? what elements matter?
        → ⛔ DO NOT use any coordinates from the image tool response
Step 4: Run GPA-GUI-Detector (detect_icons) on the screenshot
        → get all UI component bounding boxes with coordinates
Step 5: LLM decides what to click
        → combine: OCR text labels + visual understanding + detector positions
        → identify the target element
        → get its coordinates from OCR or detector results (NEVER from image tool)
        → execute the click at those coordinates
```

**Phase 2: Familiar page / repeated workflow (OPTIMIZATION)**

Once you've seen a page before and know what to expect, skip the image tool to save tokens.

```
Step 1: Take screenshot (but don't send to image tool)
Step 2: Run OCR + GPA-GUI-Detector on the screenshot
        → get text + coordinates as structured text data
Step 3: LLM reads the text output directly (no visual analysis needed)
        → identify the target element from text labels and positions
        → click using OCR/detector coordinates
```

**When to transition from Phase 1 to Phase 2:**
- You've successfully operated on this page/state before
- The OCR + detector text output gives you enough information to decide without seeing the screenshot
- You're confident about what elements to expect on this page

**When to fall back to Phase 1:**
- Something unexpected happened (wrong page, new popup, error)
- OCR + detector output doesn't make sense or seems incomplete
- You're unsure about the current state
- Whenever in doubt — Phase 1 is always safe

#### Summary of rules
- **OCR → coordinates ✅** — use for clicking text elements
- **GPA-GUI-Detector → coordinates ✅** — use for clicking non-text UI elements
- **image tool → understanding only ⛔ NO coordinates** — use for deciding WHAT to click, then get the WHERE from OCR/detector
- **Phase 1 is the safe default** — always start here, optimize to Phase 2 only when confident
- **Remote VMs (OSWorld)** — download screenshot to Mac, run OCR and/or detector locally, send coordinates back to VM. Same three methods, same rules, same phases.

---

You ARE the agent loop. Every GUI task follows this flow:

```
OBSERVE → ENSURE APP READY → ACT+SAVE (detect→match→save components→execute→diff→save transition) → REPORT
```

## Sub-Skills

Each step in the execution flow below has a corresponding sub-skill file. **When you reach that step, you MUST `read` the sub-skill file first.** This is not optional — the sub-skill contains the exact procedure and rules for that step.

| Step | Sub-Skill | Read when |
|------|-----------|-----------|
| **Observe** | `read {baseDir}/skills/gui-observe/SKILL.md` | MUST read before taking any screenshot or detecting state |
| **Learn** | `read {baseDir}/skills/gui-learn/SKILL.md` | MUST read before learning a new app or re-learning components |
| **Act + Memory** | `read {baseDir}/skills/gui-act/SKILL.md` | MUST read before any action. Includes detection, matching, execution, AND memory saving as one unified flow |
| **Memory (reference)** | `read {baseDir}/skills/gui-memory/SKILL.md` | Reference for memory structure (split storage: meta/components/states/transitions, forgetting, browser sites/) |
| **Workflow** | `read {baseDir}/skills/gui-workflow/SKILL.md` | MUST read before multi-step navigation or state graph operations |
| **Setup** | `read {baseDir}/skills/gui-setup/SKILL.md` | MUST read before first-time setup on a new machine |
| **Report** | `read {baseDir}/skills/gui-report/SKILL.md` | MUST read before tracking or reporting task performance |

## Core Commands

**exec timeout**: Always use `timeout=60` for GUI commands. Commands return immediately when done; the timeout only caps maximum wait.

```bash
source ~/gui-agent-env/bin/activate
cd ~/.openclaw/workspace/skills/gui-agent

# Observe
python3 scripts/agent.py learn --app AppName        # Detect + save components
python3 scripts/agent.py detect --app AppName        # Match known components
python3 scripts/agent.py list --app AppName          # List saved components

# Act
python3 scripts/agent.py click --app AppName --component ButtonName
python3 scripts/agent.py open --app AppName
python3 scripts/agent.py cleanup --app AppName

# State graph
python3 scripts/app_memory.py transitions --app AppName     # View state graph
python3 scripts/app_memory.py path --app AppName --component from_state --contact to_state  # Find route

# Messaging (prints guidance, agent executes step by step)
python3 scripts/agent.py send_message --app WeChat --contact "小明" --message "明天见"
```

## Execution Flow

### STEP 0: OBSERVE
→ **MUST `read {baseDir}/skills/gui-observe/SKILL.md` first**

Take screenshot. Run GPA-GUI-Detector + OCR to detect all UI elements. Use `image` tool only to **understand** the scene (not for coordinates).

### STEP 1: ENSURE APP READY
→ **MUST `read {baseDir}/skills/gui-learn/SKILL.md` first** (if learning needed)

If app not in memory → learn. If component not found → re-learn current state.

### STEP 2: ACT + SAVE (one unified step, per-click)
→ **MUST `read {baseDir}/skills/gui-act/SKILL.md` first**

gui-act defines the 7-sub-step flow for EACH click:

1. **DETECT** — screenshot → OCR + GPA-GUI-Detector
2. **MATCH** — compare against saved memory
3. **SAVE COMPONENTS** — new elements → `learn_from_screenshot()` (BEFORE clicking!)
4. **DECIDE & EXECUTE** — pick target → click at detected coordinates
5. **DETECT AGAIN** — screenshot after click (if needed to verify)
6. **DIFF** — compare before vs after
7. **SAVE TRANSITION** — `record_page_transition()` records state change

**Component saving happens BEFORE the click (step 3), not after.** This ensures memory is always populated even if the click fails.

Both save functions are automated — no manual cropping or JSON editing:
- `learn_from_screenshot(img_path, domain, app_name, page_name)` — auto-detects, crops, deduplicates, saves all components
- `record_page_transition(before_img, after_img, click_label, click_pos, domain)` — auto-diffs OCR, saves states + transition

For memory structure details (split storage format, forgetting mechanism, browser sites/): `read {baseDir}/skills/gui-memory/SKILL.md`

### STEP 3: REPORT

Report is mostly automatic (detect_all auto-starts tracker, functions auto-tick counters).
At the END of a GUI task, run this one command to generate and save the report:

```bash
source ~/gui-agent-env/bin/activate
python3 ~/.openclaw/workspace/skills/gui-agent/skills/gui-report/scripts/tracker.py report
```

This prints a one-line summary + saves full data to `logs/task_history.jsonl`.
If you forget, data is auto-saved next time tracker starts.

---

## ⛔ ABSOLUTE RULES (read every time, no exceptions)

**WHERE DO CLICK COORDINATES COME FROM?**

```
✅ ALLOWED coordinate sources:
   1. GPA-GUI-Detector (detect_icons) → bounding box center
   2. Apple Vision OCR (detect_text) → text bounding box center
   3. Template matching → saved component position

❌ FORBIDDEN:
   - LLM/vision model guessing coordinates ("it looks like it's around 500, 300")
   - Hardcoded pixel positions from memory or documentation
   - Coordinates from image tool analysis (image tool = understanding ONLY)
```

**Every click**: screenshot → run GPA-GUI-Detector and/or OCR → get coordinates from detection result → click that coordinate. No exceptions. If detection can't find the element, re-detect or re-learn — do NOT guess.

**This applies everywhere**: local Mac apps, remote VMs (OSWorld), any platform. For remote VMs: download screenshot to Mac → run detection locally → send click coordinates back to VM.

## Key Principles

1. **Vision-driven, no shortcuts** — screenshot → detect → match → click. Only allowed system calls: `activate` (bring to front), `screencapture`, `platform_input.py` (pynput click/type).
2. **Coordinates from detection only** — see ABSOLUTE RULES above. The `image` tool is for understanding ("what is this?", "which button should I click?"), NEVER for getting pixel coordinates.
3. **Not found = not on screen** — don't lower thresholds. Re-learn current state to discover what IS on screen.
4. **State graph drives navigation** — each click records a transition. Use `find_path()` to route between states.
5. **First time: screenshot + image. Repeat: detection only** — saves tokens on known workflows.
6. **Paste > Type** for CJK text
7. **Integer logical coordinates** — pynput uses screen logical pixels
8. **ALWAYS save to memory** — every GUI operation MUST save detection results, learned components, and state information to `memory/apps/<appname>/`. This is the core of the system. Even for one-off tasks or benchmarks (e.g., OSWorld), save what you learn about each app. Memory is local (gitignored) but essential — it's what makes GUI Agent Skills learn and improve.

## Safety Rules

1. **Full-screen search + window validation** — match on full screen, reject matches outside target app's window bounds
2. **App switch detection** — `click_component` checks frontmost app after every click
3. **No wrong-app learning** — validate frontmost app before learn
4. **Reject tiny templates** — <30×30 pixels produce false matches
5. **Never send screenshots to chat** — internal detection only
6. **NEVER quit the communication app** — if a dialog asks to quit apps (like CleanMyMac's "Quit All"), NEVER quit Discord/Telegram/WhatsApp or whatever channel you're communicating through. Instead: click "Ignore" to skip. Quitting the comms app disconnects you from the user.
7. **Watch for new dialogs/windows** — clicking a button may spawn a new dialog or window. After clicking, check if a new window appeared and handle it before continuing.
8. **Every click uses `click_and_record` or `click_component`** — never raw `click_at()`. Every click must record a state transition.

## Input Methods (platform_input.py)

```python
from platform_input import click_at, paste_text, key_press, key_combo, screenshot, 
    activate_app, get_clipboard, set_clipboard, mouse_right_click
```

No cliclick. No osascript for input. pynput only.

## File Structure

```
gui-agent/
├── SKILL.md              # This file
├── skills/               # Sub-skills (read on demand)
├── scripts/
│   ├── agent.py          # CLI entry point
│   ├── app_memory.py     # Components, states, transitions, matching
│   ├── platform_input.py # Cross-platform input (pynput)
│   ├── ui_detector.py    # GPA-GUI-Detector + OCR detection
│   └── template_match.py # Legacy template matching
├── memory/               # Visual memory (gitignored but ESSENTIAL)
│   ├── apps/
│   │   ├── <appname>/
│   │   │   ├── meta.json          # Metadata (detect_count, forget_threshold)
│   │   │   ├── components.json    # Component registry + activity tracking
│   │   │   ├── states.json        # States defined by component sets
│   │   │   ├── transitions.json   # State transitions (dict, deduped)
│   │   │   ├── components/        # Cropped UI element templates
│   │   │   └── pages/             # Page screenshots
│   │   └── chromium/              # Browser example
│   │       ├── meta.json, components.json, states.json, transitions.json
│   │       ├── components/
│   │       ├── pages/
│   │       └── sites/             # ⭐ Each website = same 4-file structure
│   │           ├── united.com/
│   │           ├── delta.com/
│   │           └── ...
└── README.md
```