Building Computer Use Agents (CUA)
Building Computer Use Agents (CUA)
Building Computer Use Agents
◈ Technical Reference · 2025

Building
Computer Use
Agents

A complete guide to designing, architecting, and deploying AI agents that perceive and interact with real computer interfaces.


What is a Computer Use Agent?

A Computer Use Agent (CUA) is an AI system that can observe a computer’s screen (via screenshots or accessibility trees), understand what it sees, and take actions — clicking, typing, scrolling, launching programs — just like a human operator would. CUAs combine a vision-language model for perception with a tool-use layer for actuation, wrapped in an action-observation loop that continues until the task is complete.

Unlike traditional RPA bots tied to fixed UI selectors, CUAs generalize across interfaces by reasoning about what they see, making them robust to UI changes and capable of handling novel tasks described in plain language.

Agent Action Loop

🎯
User Task
Natural language instruction received
📸
Capture Screen
Screenshot or accessibility tree
👁️
Perceive & Plan
Vision-LLM interprets state & reasons next action
◀ LOOP BACK
🤔
Task Complete?
Check goal condition
NO
🖱️
Execute Action
click · type · scroll · keypress · drag
🔄
Observe Result
New screenshot captured after action
YES
Task Done
Return result to user
📤
Report Output
Summary, artifacts, or final state

↑ The NO branch loops back to “Capture Screen” until the goal is satisfied or max steps reached


System Architecture

🧠

Vision-Language Model

The “brain” of the agent. A multimodal LLM (e.g., Claude, GPT-4o) that receives screenshots and task descriptions, reasons about the current state, and outputs the next action to take.

🖥️

Screen Capture Layer

Captures the current desktop state as a screenshot or structured accessibility tree. Provides the visual observation that the VLM uses for reasoning. Runs after every action.

🛠️

Action Executor

Translates model outputs into real OS-level actions: mouse clicks at (x,y), keyboard input, scrolling, window management. Can use pyautogui, xdotool, Playwright, or OS APIs.

📝

Memory & Context

Maintains a rolling history of past actions and observations. Prevents the agent from repeating mistakes and provides context for multi-step task execution across many turns.

🔒

Safety & Guardrails

Validates actions before execution. Restricts access to sensitive OS areas, enforces action budgets, prevents runaway loops, and supports human-in-the-loop confirmation for risky steps.

📊

Task Planner

Optional high-level planner that decomposes complex goals into subtasks, tracks progress, and re-plans when subtasks fail — enabling more reliable long-horizon task completion.


Real-World Use Cases

📧

Email Triage & Response

  • 1Agent opens mail client, takes screenshot
  • 2VLM identifies unread emails and categorizes by urgency
  • 3Clicks into each email, reads content
  • 4Drafts and sends replies based on context
  • 5Archives or labels processed messages
📊

Data Entry Automation

  • 1User provides CSV and target web form URL
  • 2Agent opens browser, navigates to form
  • 3Reads field labels from screenshot
  • 4Types values, handles dropdowns & date pickers
  • 5Submits form, repeats for each row
🧪

UI Testing Agent

  • 1Receives test scenario in plain English
  • 2Launches app and navigates UI
  • 3Performs actions: login, fill forms, click buttons
  • 4Compares screenshots to expected states
  • 5Reports bugs with annotated screenshots
🔍

Research Assistant

  • 1User asks: “Find pricing of top 5 CRM tools”
  • 2Agent opens browser, searches each company
  • 3Navigates to pricing pages, reads content
  • 4Compiles results into a structured table
  • 5Returns summary with source links
💻

Developer Workflow Agent

  • 1Opens terminal, pulls latest code
  • 2Runs test suite, reads failure output
  • 3Opens relevant files in editor
  • 4Makes targeted edits, reruns tests
  • 5Commits and pushes on success
🗓️

Calendar Scheduling

  • 1“Schedule a 30-min meeting with the team next week”
  • 2Agent opens calendar app via screenshot
  • 3Checks availability across attendees
  • 4Creates event, fills in title/location/notes
  • 5Sends invitations, confirms bookings

Minimal CUA Loop (Python)

PYTHON · CUA CORE LOOP
# Minimal Computer Use Agent loop using Anthropic + pyautogui
import anthropic, pyautogui, base64, io
from PIL import ImageGrab

client = anthropic.Anthropic()
MAX_STEPS = 20

def capture_screen() -> str:
    """Take screenshot, return as base64 PNG"""
    img = ImageGrab.grab()
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    return base64.b64encode(buf.getvalue()).decode()

def execute_action(action: dict):
    """Map model output → real OS action"""
    t = action["type"]
    if   t == "click":  pyautogui.click(action["x"], action["y"])
    elif t == "type":   pyautogui.typewrite(action["text"], interval=0.05)
    elif t == "scroll": pyautogui.scroll(action.get("delta", -3))
    elif t == "key":    pyautogui.hotkey(*action["keys"])

def run_agent(task: str):
    messages = []
    for step in range(MAX_STEPS):
        # 1. Capture current screen state
        screenshot = capture_screen()

        # 2. Ask VLM: what action to take next?
        messages.append({"role": "user", "content": [{
            "type": "image",
            "source": {"type": "base64", "media_type": "image/png", "data": screenshot}
        }, {"type": "text", "text": f"Task: {task}\nWhat is the next action? Reply with JSON: {{type, x?, y?, text?, keys?}}\nOr reply DONE if complete."}]})

        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=512,
            messages=messages
        )

        reply = response.content[0].text.strip()
        messages.append({"role": "assistant", "content": reply})

        # 3. Check if done
        if "DONE" in reply:
            print("✅ Task complete")
            break

        # 4. Execute the action
        import json
        action = json.loads(reply)
        execute_action(action)

run_agent("Open Chrome, go to wikipedia.org, search for 'Turing machine'")

Tools & Frameworks

🤖 Anthropic Claude API
🌐 Microsoft Playwright
🖱️ PyAutoGUI
🖥️ xdotool (Linux)
🍎 AppleScript / Shortcuts
🪟 Windows UIAutomation
🧩 LangChain / LangGraph
📦 OpenAI Swarm / AgentKit
🔬 OpenAdapt / OSWorld
🌳 Selenium WebDriver
🧠 SOM (Set-of-Marks)
🔄 Temporal Workflows

Key Challenges

🎯
Grounding Accuracy
Correctly mapping visual elements to (x,y) coordinates. Small errors cascade. SOM prompting and element detection help.
Latency & Cost
Every step requires a VLM call with a large image. Optimize by compressing screenshots and caching repeated observations.
🔁
Loop Detection
Agents can get stuck repeating the same failing action. Track action history and implement escape heuristics.
🛡️
Safety & Permissions
Agents operating on real machines can cause irreversible damage. Sandboxing, confirmations, and action whitelists are essential.
🔄
Dynamic UIs
Websites and apps change. Visual reasoning is more robust than hard-coded selectors, but still requires tolerance for variation.
🧪
Evaluation
Measuring success is hard. Use benchmarks like OSWorld, WebArena, or build task-specific golden trajectories for regression testing.

COMPUTER USE AGENTS · REFERENCE GUIDE · 2025 — Built with Claude

Leave a Reply

Your email address will not be published. Required fields are marked *