Skill

Automate Computer Tasks with Vision Agents

Build AI agents that control computers through vision-based screen interaction-viewing, clicking, typing-using Anthropic Computer Use, OpenAI Operator, or

Works with githubdockerubuntu

46
Spark score
out of 100
Updated 2 days ago
Version 13.1.0

Add to Favorites

Why it matters

Build AI agents that can perceive, reason about, and interact with computer interfaces like humans, automating complex desktop tasks through a vision-based action loop.

Outcomes

What it gets done

01

Develop agents that view screens, move cursors, and perform clicks/typing.

02

Implement the Perception-Reasoning-Action loop for iterative task completion.

03

Integrate vision models with desktop control for autonomous operation.

04

Understand and mitigate security risks through sandboxing and isolation.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/ag-computer-use-agents | bash

Capabilities

What this skill does

Automate the OS

Runs system commands and automates desktop tasks.

Drive a browser

Controls a real browser to automate web workflows.

Scrape

Fetches and parses content from web pages.

Search the web

Searches the web and retrieves relevant sources.

Generate code

Writes source code or scripts from a description.

Overview

Computer Use Agents

What it does

Perception-reasoning-action loop architecture for vision-based computer control agents with Docker sandboxing

How it connects

When you need AI agents to automate desktop tasks by viewing screens and executing mouse/keyboard actions while maintaining strict security isolation

Source README

Computer Use Agents

Build AI agents that interact with computers like humans do - viewing screens,
moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer
Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on
sandboxing, security, and handling the unique challenges of vision-based control.

Patterns

Perception-Reasoning-Action Loop

The fundamental architecture of computer use agents: observe screen,
reason about next action, execute action, repeat. This loop integrates
vision models with action execution through an iterative pipeline.

Key components:

  1. PERCEPTION: Screenshot captures current screen state
  2. REASONING: Vision-language model analyzes and plans
  3. ACTION: Execute mouse/keyboard operations
  4. FEEDBACK: Observe result, continue or correct

Critical insight: Vision agents are completely still during "thinking"
phase (1-5 seconds), creating a detectable pause pattern.

When to use: Building any computer use agent from scratch,Integrating vision models with desktop control,Understanding agent behavior patterns

from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time

class ComputerUseAgent:
"""
Perception-Reasoning-Action loop implementation.
Based on Anthropic Computer Use patterns.
"""

def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
    self.client = client
    self.model = model
    self.max_steps = 50  # Prevent runaway loops
    self.action_delay = 0.5  # Seconds between actions

def capture_screenshot(self) -> str:
    """Capture screen and return base64 encoded image."""
    screenshot = pyautogui.screenshot()
    # Resize for token efficiency (1280x800 is good balance)
    screenshot = screenshot.resize((1280, 800), Image.LANCZOS)

    import io
    buffer = io.BytesIO()
    screenshot.save(buffer, format="PNG")
    return base64.b64encode(buffer.getvalue()).decode()

def execute_action(self, action: dict) -> dict:
    """Execute mouse/keyboard action on the computer."""
    action_type = action.get("type")

    if action_type == "click":
        x, y = action["x"], action["y"]
        button = action.get("button", "left")
        pyautogui.click(x, y, button=button)
        return {"success": True, "action": f"clicked at ({x}, {y})"}

    elif action_type == "type":
        text = action["text"]
        pyautogui.typewrite(text, interval=0.02)
        return {"success": True, "action": f"typed {len(text)} chars"}

    elif action_type == "key":
        key = action["key"]
        pyautogui.press(key)
        return {"success": True, "action": f"pressed {key}"}

    elif action_type == "scroll":
        direction = action.get("direction", "down")
        amount = action.get("amount", 3)
        scroll = -amount if direction == "down" else amount
        pyautogui.scroll(scroll)
        return {"success": True, "action": f"scrolled {direction}"}

    elif action_type == "move":
        x, y = action["x"], action["y"]
        pyautogui.moveTo(x, y)
        return {"success": True, "action": f"moved to ({x}, {y})"}

    else:
        return {"success": False, "error": f"Unknown action: {action_type}"}

def run(self, task: str) -> dict:
    """
    Run perception-reasoning-action loop until task complete.

    The loop:
    1. Screenshot current state
    2. Send to vision model with task context
    3. Parse action from response
    4. Execute action
    5. Repeat until done or max steps
    """
    messages = []
    step_count = 0

    system_prompt = """You are a computer use agent. You can see the screen
    and control mouse/keyboard.

    Available actions (respond with JSON):
    - {"type": "click", "x": 100, "y": 200, "button": "left"}
    - {"type": "type", "text": "hello world"}
    - {"type": "key", "key": "enter"}
    - {"type": "scroll", "direction": "down", "amount": 3}
    - {"type": "done", "result": "task completed successfully"}

    Always respond with ONLY a JSON action object.
    Be precise with coordinates - click exactly where needed.
    If you see an error, try to recover.
    """

    while step_count < self.max_steps:
        step_count += 1

        # 1. PERCEPTION: Capture current screen
        screenshot_b64 = self.capture_screenshot()

        # 2. REASONING: Send to vision model
        user_content = [
            {"type": "text", "text": f"Task: {task}\n\nStep {step_count}. What action should I take?"},
            {"type": "image", "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": screenshot_b64
            }}
        ]

        messages.append({"role": "user", "content": user_content})

        response = self.client.messages.create(
            model=self.model,
            max_tokens=1024,
            system=system_prompt,
            messages=messages
        )

        assistant_message = response.content[0].text
        messages.append({"role": "assistant", "content": assistant_message})

        # 3. Parse action from response
        import json
        try:
            action = json.loads(assistant_message)
        except json.JSONDecodeError:
            # Try to extract JSON from response
            import re
            match = re.search(r'\{[^}]+\}', assistant_message)
            if match:
                action = json.loads(match.group())
            else:
                continue

        # Check if done
        if action.get("type") == "done":
            return {
                "success": True,
                "result": action.get("result"),
                "steps": step_count
            }

        # 4. ACTION: Execute
        result = self.execute_action(action)

        # Small delay for UI to update
        time.sleep(self.action_delay)

    return {
        "success": False,
        "error": "Max steps reached",
        "steps": step_count
    }

Usage

agent = ComputerUseAgent(Anthropic())
result = agent.run("Open Chrome and search for 'weather today'")

Anti_patterns

  • Running without step limits (infinite loops)
  • No delay between actions (UI can't keep up)
  • Screenshots at full resolution (token explosion)
  • Ignoring action failures (no recovery)

Sandboxed Environment Pattern

Computer use agents MUST run in isolated, sandboxed environments.
Never give agents direct access to your main system - the security
risks are too high. Use Docker containers with virtual desktops.

Key isolation requirements:

  1. NETWORK: Restrict to necessary endpoints only
  2. FILESYSTEM: Read-only or scoped to temp directories
  3. CREDENTIALS: No access to host credentials
  4. SYSCALLS: Filter dangerous system calls
  5. RESOURCES: Limit CPU, memory, time

The goal is "blast radius minimization" - if the agent goes wrong,
damage is contained to the sandbox.

When to use: Deploying any computer use agent,Testing agent behavior safely,Running untrusted automation tasks

Dockerfile for sandboxed computer use environment

Based on Anthropic's reference implementation pattern

FROM ubuntu:22.04

Install desktop environment

RUN apt-get update && apt-get install -y
xvfb
x11vnc
fluxbox
xterm
firefox
python3
python3-pip
supervisor

Security: Create non-root user

RUN useradd -m -s /bin/bash agent &&
mkdir -p /home/agent/.vnc

Install Python dependencies

COPY requirements.txt /tmp/
RUN pip3 install -r /tmp/requirements.txt

Security: Drop capabilities

RUN apt-get install -y --no-install-recommends libcap2-bin &&
setcap -r /usr/bin/python3 || true

Copy agent code

COPY --chown=agent:agent . /app
WORKDIR /app

Supervisor config for virtual display + VNC

COPY supervisord.conf /etc/supervisor/conf.d/

Expose VNC port only (not desktop directly)

EXPOSE 5900

Run as non-root

USER agent

CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]


docker-compose.yml with security constraints

version: '3.8'

services:
computer-use-agent:
build: .
ports:
- "5900:5900" # VNC for observation
- "8080:8080" # API for control

# Security constraints
security_opt:
  - no-new-privileges:true
  - seccomp:seccomp-profile.json

# Resource limits
deploy:
  resources:
    limits:
      cpus: '2'
      memory: 4G
    reservations:
      cpus: '0.5'
      memory: 1G

# Network isolation
networks:
  - agent-network

# No access to host filesystem
volumes:
  - agent-tmp:/tmp

# Read-only root filesystem
read_only: true
tmpfs:
  - /run
  - /var/run

# Environment
environment:
  - DISPLAY=:99
  - NO_PROXY=localhost

networks:
agent-network:
driver: bridge
internal: true # No internet by default

volumes:
agent-tmp:


Python wrapper with additional runtime sandboxing

import subprocess
import os
from dataclasses import dataclass
from typing import Optional

@dataclass
class SandboxConfig:
"""Configuration for agent sandbox."""
network_allowed: list[str] = None # Allowed domains
max_runtime_seconds: int = 300
max_memory_mb: int = 2048
allow_downloads: bool = False
allow_clipboard: bool = False

class SandboxedAgent:
"""
Run computer use agent in Docker sandbox.
"""

def __init__(self, config: SandboxConfig):
    self.config = config
    self.container_id: Optional[str] = None

def start(self):
    """Start sandboxed environment."""
    # Build network rules
    network_rules = ""
    if self.config.network_allowed:
        for domain in self.config.network_allowed:
            network_rules += f"--add-host={domain}:$(dig +short {domain}) "
    else:
        network_rules = "--network=none"

    cmd = f"""
    docker run -d \
        --name computer-use-sandbox-$$ \
        --security-opt no-new-privileges \
        --cap-drop ALL \
        --memory {self.config.max_memory_mb}m \
        --cpus 2 \
        --read-only \
        --tmpfs /tmp \
        {network_rules} \
        computer-use-agent:latest
    """

    result = subprocess.run(cmd, shell=True, capture_output=True)
    self.container_id = result.stdout.decode().strip()

    # Set up kill timer
    subprocess.Popen([
        "sh", "-c",
        f"sleep {self.config.max_runtime_seconds} && docker kill {self.container_id}"
    ])

    return self.container_id

def execute_task(self, task: str) -> dict:
    """Execute task in sandbox."""
    if not self.container_id:
        self.start()

    # Send task to agent via API
    import requests
    response = requests.post(
        f"http://localhost:8080/task",
        json={"task": task},
        timeout=self.config.max_runtime_seconds
    )

    return response.json()

def stop(self):
    """Stop and remove sandbox."""
    if self.container_id:
        subprocess.run(f"docker rm -f {self.container_id}", shell=True)
        self.container_id = None

Anti_patterns

  • Running agents on host system directly
  • Giving sandbox full network access
  • Running as root in container
  • No resource limits (denial of service)
  • Persistent storage (data can leak between runs)

Anthropic Computer Use Implementation

Official implementation pattern using Claude's computer use capability.
Claude 3.5 Sonnet was the first frontier model to offer computer use.
Claude Opus 4.5 is now the "best model in the world for computer use."

Key capabilities:

  • screenshot: Capture current screen state
  • mouse: Click, move, drag operations
  • keyboard: Type text, press keys
  • bash: Run shell commands
  • text_editor: View and edit files

Tool versions:

  • computer_20251124 (Opus 4.5): Adds zoom action for detailed inspection
  • computer_20250124 (All other models): Standard capabilities

Critical limitation: "Some UI elements (like dropdowns and scrollbars)
might be tricky for Claude to manipulate" - Anthropic docs

When to use: Building production computer use agents,Need highest quality vision understanding,Full desktop control (not just browser)

from anthropic import Anthropic
from anthropic.types.beta import (
BetaToolComputerUse20241022,
BetaToolBash20241022,
BetaToolTextEditor20241022,
)
import subprocess
import base64
from PIL import Image
import io

class AnthropicComputerUse:
"""
Official Anthropic Computer Use implementation.

Requires:
- Docker container with virtual display
- VNC for viewing agent actions
- Proper tool implementations
"""

def __init__(self):
    self.client = Anthropic()
    self.model = "claude-sonnet-4-20250514"  # Best for computer use
    self.screen_size = (1280, 800)

def get_tools(self) -> list:
    """Define computer use tools."""
    return [
        BetaToolComputerUse20241022(
            type="computer_20241022",
            name="computer",
            display_width_px=self.screen_size[0],
            display_height_px=self.screen_size[1],
        ),
        BetaToolBash20241022(
            type="bash_20241022",
            name="bash",
        ),
        BetaToolTextEditor20241022(
            type="text_editor_20241022",
            name="str_replace_editor",
        ),
    ]

def execute_tool(self, name: str, input: dict) -> dict:
    """Execute a tool and return result."""

    if name == "computer":
        return self._handle_computer_action(input)
    elif name == "bash":
        return self._handle_bash(input)
    elif name == "str_replace_editor":
        return self._handle_editor(input)
    else:
        return {"error": f"Unknown tool: {name}"}

def _handle_computer_action(self, input: dict) -> dict:
    """Handle computer control actions."""
    action = input.get("action")

    if action == "screenshot":
        # Capture via xdotool/scrot
        subprocess.run(["scrot", "/tmp/screenshot.png"])

        with open("/tmp/screenshot.png", "rb") as f:
            img_data = f.read()

        # Resize for efficiency
        img = Image.open(io.BytesIO(img_data))
        img = img.resize(self.screen_size, Image.LANCZOS)

        buffer = io.BytesIO()
        img.save(buffer, format="PNG")

        return {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64.b64encode(buffer.getvalue()).decode()
            }
        }

    elif action == "mouse_move":
        x, y = input.get("coordinate", [0, 0])
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        return {"success": True}

    elif action == "left_click":
        subprocess.run(["xdotool", "click", "1"])
        return {"success": True}

    elif action == "right_click":
        subprocess.run(["xdotool", "click", "3"])
        return {"success": True}

    elif action == "double_click":
        subprocess.run(["xdotool", "click", "--repeat", "2", "1"])
        return {"success": True}

    elif action == "type":
        text = input.get("text", "")
        # Use xdotool type with delay for reliability
        subprocess.run(["xdotool", "type", "--delay", "50", text])
        return {"success": True}

    elif action == "key":
        key = input.get("key", "")
        # Map common key names
        key_map = {
            "return": "Return",
            "enter": "Return",
            "tab": "Tab",
            "escape": "Escape",
            "backspace": "BackSpace",
        }
        xdotool_key = key_map.get(key.lower(), key)
        subprocess.run(["xdotool", "key", xdotool_key])
        return {"success": True}

    elif action == "scroll":
        direction = input.get("direction", "down")
        amount = input.get("amount", 3)
        button = "5" if direction == "down" else "4"
        for _ in range(amount):
            subprocess.run(["xdotool", "click", button])
        return {"success": True}

    return {"error": f"Unknown action: {action}"}

def _handle_bash(self, input: dict) -> dict:
    """Execute bash command."""
    command = input.get("command", "")

    # Security: Sanitize and limit commands
    dangerous_patterns = ["rm -rf", "mkfs", "dd if=", "> /dev/"]
    for pattern in dangerous_patterns:
        if pattern in command:
            return {"error": "Dangerous command blocked"}

    try:
        result = subprocess.run(
            command,
            shell=True,
            capture_output=True,
            text=True,
            timeout=30
        )
        return {
            "stdout": result.stdout[:10000],  # Limit output
            "stderr": result.stderr[:1000],
            "returncode": result.returncode
        }
    except subprocess.TimeoutExpired:
        return {"error": "Command timed out"}

def _handle_editor(self, input: dict) -> dict:
    """Handle text editor operations."""
    command = input.get("command")
    path = input.get("path")

    if command == "view":
        try:
            with open(path, "r") as f:
                content = f.read()
            return {"content": content[:50000]}  # Limit size
        except Exception as e:
            return {"error": str(e)}

    elif command == "str_replace":
        old_str = input.get("old_str")
        new_str = input.get("new_str")
        try:
            with open(path, "r") as f:
                content = f.read()
            if old_str not in content:
                return {"error": "old_str not found in file"}
            content = content.replace(old_str, new_str, 1)
            with open(path, "w") as f:
                f.write(content)
            return {"success": True}
        except Exception as e:
            return {"error": str(e)}

    return {"error": f"Unknown editor command: {command}"}

def run_task(self, task: str, max_steps: int = 50) -> dict:
    """Run computer use task with agentic loop."""
    messages = [{"role": "user", "content": task}]
    tools = self.get_tools()

    for step in range(max_steps):
        response = self.client.beta.messages.create(
            model=self.model,
            max_tokens=4096,
            tools=tools,
            messages=messages,
            betas=["computer-use-2024-10-22"]
        )

        # Check for completion
        if response.stop_reason == "end_turn":
            return {
                "success": True,
                "result": response.content[0].text if response.content else "",
                "steps": step + 1
            }

        # Handle tool use
        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})

            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = self.execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

            messages.append({"role": "user", "content": tool_results})

    return {"success": False, "error": "Max steps reached"}

Anti_patterns

  • Not using betas=['computer-use-2024-10-22'] flag
  • Full resolution screenshots (wasteful)
  • No command sanitization for bash tool
  • Unbounded execution time

Browser-Use Pattern (Playwright-based)

For browser-only automation, using structured DOM access is more efficient
than pixel-based computer use. Playwright MCP allows LLMs to control
browsers using accessibility snapshots rather than screenshots.

Advantages over vision-based:

  • Faster: No image processing required
  • Cheaper: Text tokens vs image tokens
  • More precise: Direct element targeting
  • More reliable: No coordinate drift

When to use vision vs structured:

  • Vision: Desktop apps, complex UIs, visual verification
  • Structured: Web automation, form filling, data extraction

When to use: Browser-only automation tasks,Form filling and web interactions,When speed and cost matter more than visual understanding

from playwright.async_api import async_playwright
from dataclasses import dataclass
from typing import Optional
import asyncio

@dataclass
class BrowserAction:
"""Structured browser action."""
action: str # click, type, navigate, scroll, extract
selector: Optional[str] = None
text: Optional[str] = None
url: Optional[str] = None

class BrowserUseAgent:
"""
Browser automation using Playwright with structured commands.
More efficient than pixel-based for web tasks.
"""

def __init__(self):
    self.browser = None
    self.page = None

async def start(self, headless: bool = True):
    """Start browser session."""
    self.playwright = await async_playwright().start()
    self.browser = await self.playwright.chromium.launch(headless=headless)
    self.page = await self.browser.new_page()

async def get_page_snapshot(self) -> dict:
    """
    Get structured snapshot of page for LLM.
    Uses accessibility tree for efficiency.
    """
    # Get accessibility tree
    snapshot = await self.page.accessibility.snapshot()

    # Get simplified DOM info
    elements = await self.page.evaluate('''() => {
        const interactable = [];
        const selector = 'a, button, input, select, textarea, [role="button"]';
        document.querySelectorAll(selector).forEach((el, i) => {
            const rect = el.getBoundingClientRect();
            if (rect.width > 0 && rect.height > 0) {
                interactable.push({
                    index: i,
                    tag: el.tagName.toLowerCase(),
                    text: el.textContent?.trim().slice(0, 100),
                    type: el.type,
                    placeholder: el.placeholder,
                    name: el.name,
                    id: el.id,
                    class: el.className
                });
            }
        });
        return interactable;
    }''')

    return {
        "url": self.page.url,
        "title": await self.page.title(),
        "accessibility_tree": snapshot,
        "interactable_elements": elements[:50]  # Limit for token efficiency
    }

async def execute_action(self, action: BrowserAction) -> dict:
    """Execute structured browser action."""

    try:
        if action.action == "navigate":
            await self.page.goto(action.url, wait_until="domcontentloaded")
            return {"success": True, "url": self.page.url}

        elif action.action == "click":
            await self.page.click(action.selector, timeout=5000)
            await self.page.wait_for_load_state("networkidle", timeout=5000)
            return {"success": True}

        elif action.action == "type":
            await self.page.fill(action.selector, action.text)
            return {"success": True}

        elif action.action == "scroll":
            direction = action.text or "down"
            distance = 500 if direction == "down" else -500
            await self.page.evaluate(f"window.scrollBy(0, {distance})")
            return {"success": True}

        elif action.action == "extract":
            # Extract text content
            if action.selector:
                text = await self.page.text_content(action.selector)
            else:
                text = await self.page.text_content("body")
            return {"success": True, "text": text[:5000]}

        elif action.action == "screenshot":
            # Fall back to vision when needed
            screenshot = await self.page.screenshot(type="png")
            import base64
            return {
                "success": True,
                "image": base64.b64encode(screenshot).decode()
            }

    except Exception as e:
        return {"success": False, "error": str(e)}

    return {"success": False, "error": f"Unknown action: {action.action}"}

async def run_with_llm(self, task: str, llm_client, max_steps: int = 20):
    """
    Run browser task with LLM decision making.
    Uses structured DOM instead of screenshots.
    """

    system_prompt = """You are a browser automation agent. You receive
    page snapshots with interactable elements and decide actions.

    Respond with JSON action:
    - {"action": "navigate", "url": "https://..."}
    - {"action": "click", "selector": "button.submit"}
    - {"action": "type", "selector": "input[name='email']", "text": "..."}
    - {"action": "scroll", "text": "down"}
    - {"action": "extract", "selector": ".results"}
    - {"action": "done", "result": "task completed"}

    Use CSS selectors based on the element info provided.
    Prefer id > name > class > text content for selectors.
    """

    messages = []

    for step in range(max_steps):
        # Get current page state
        snapshot = await self.get_page_snapshot()

        user_message = f"""Task: {task}

        Current page:
        URL: {snapshot['url']}
        Title: {snapshot['title']}

        Interactable elements:
        {snapshot['interactable_elements']}

        What action should I take?"""

        messages.append({"role": "user", "content": user_message})

        # Get LLM decision
        response = llm_client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=system_prompt,
            messages=messages
        )

        assistant_text = response.content[0].text
        messages.append({"role": "assistant", "content": assistant_text})

        # Parse and execute
        import json
        action_dict = json.loads(assistant_text)

        if action_dict.get("action") == "done":
            return {"success": True, "result": action_dict.get("result")}

        action = BrowserAction(**action_dict)
        result = await self.execute_action(action)

        if not result.get("success"):
            messages.append({
                "role": "user",
                "content": f"Action failed: {result.get('error')}"
            })

        await asyncio.sleep(0.5)  # Rate limit

    return {"success": False, "error": "Max steps reached"}

async def close(self):
    """Clean up browser."""
    if self.browser:
        await self.browser.close()
    if hasattr(self, 'playwright'):
        await self.playwright.stop()

Usage

async def main():
agent = BrowserUseAgent()
await agent.start(headless=False)

from anthropic import Anthropic
result = await agent.run_with_llm(
    "Go to weather.com and find the weather for New York",
    Anthropic()
)

print(result)
await agent.close()

asyncio.run(main())

Anti_patterns

  • Using screenshots when DOM access works
  • Not waiting for page loads
  • Hardcoded selectors that break
  • No error recovery for stale elements

User Confirmation Pattern

For sensitive actions, agents should pause and ask for human confirmation.
"ChatGPT agent also pauses and asks for confirmation prior to taking
sensitive steps such as completing a purchase."

Sensitivity levels:

  1. LOW: Navigation, reading (auto-approve)
  2. MEDIUM: Form filling, clicking (log, maybe confirm)
  3. HIGH: Purchases, authentication, file operations (always confirm)
  4. CRITICAL: Credential entry, financial transactions (confirm + review)

When to use: Actions with real-world consequences,Financial transactions,Authentication flows,File modifications

from enum import Enum
from dataclasses import dataclass
from typing import Callable, Optional
import asyncio

class ActionSeverity(Enum):
LOW = "low" # Auto-approve
MEDIUM = "medium" # Log, optional confirm
HIGH = "high" # Always confirm
CRITICAL = "critical" # Confirm + review details

@dataclass
class SensitiveAction:
"""Action that may need user confirmation."""
action_type: str
description: str
severity: ActionSeverity
details: dict

class ConfirmationGate:
"""
Gate sensitive actions through user confirmation.
"""

# Action type -> severity mapping
ACTION_SEVERITY = {
    # LOW - auto-approve
    "navigate": ActionSeverity.LOW,
    "scroll": ActionSeverity.LOW,
    "read": ActionSeverity.LOW,
    "screenshot": ActionSeverity.LOW,

    # MEDIUM - log and maybe confirm
    "click": ActionSeverity.MEDIUM,
    "type": ActionSeverity.MEDIUM,
    "search": ActionSeverity.MEDIUM,

    # HIGH - always confirm
    "download": ActionSeverity.HIGH,
    "submit_form": ActionSeverity.HIGH,
    "login": ActionSeverity.HIGH,
    "file_write": ActionSeverity.HIGH,

    # CRITICAL - confirm with full review
    "purchase": ActionSeverity.CRITICAL,
    "enter_password": ActionSeverity.CRITICAL,
    "enter_credit_card": ActionSeverity.CRITICAL,
    "send_money": ActionSeverity.CRITICAL,
    "delete": ActionSeverity.CRITICAL,
}

def __init__(
    self,
    confirm_callback: Callable[[SensitiveAction], bool] = None,
    auto_confirm_low: bool = True,
    auto_confirm_medium: bool = False
):
    self.confirm_callback = confirm_callback or self._default_confirm
    self.auto_confirm_low = auto_confirm_low
    self.auto_confirm_medium = auto_confirm_medium
    self.action_log = []

def _default_confirm(self, action: SensitiveAction) -> bool:
    """Default confirmation via CLI prompt."""
    print(f"\n{'='*60}")
    print(f"ACTION CONFIRMATION REQUIRED")
    print(f"{'='*60}")
    print(f"Type: {action.action_type}")
    print(f"Severity: {action.severity.value.upper()}")
    print(f"Description: {action.description}")
    print(f"Details: {action.details}")
    print(f"{'='*60}")

    while True:
        response = input("Allow this action? [y/n]: ").lower().strip()
        if response in ['y', 'yes']:
            return True
        elif response in ['n', 'no']:
            return False

def classify_action(self, action_type: str, context: dict) -> ActionSeverity:
    """Classify action severity, considering context."""
    base_severity = self.ACTION_SEVERITY.get(action_type, ActionSeverity.MEDIUM)

    # Escalate based on context
    if context.get("involves_credentials"):
        return ActionSeverity.CRITICAL
    if context.get("involves_money"):
        return ActionSeverity.CRITICAL
    if context.get("irreversible"):
        return max(base_severity, ActionSeverity.HIGH, key=lambda x: x.value)

    return base_severity

def check_action(
    self,
    action_type: str,
    description: str,
    details: dict = None
) -> tuple[bool, str]:
    """
    Check if action should proceed.
    Returns (approved, reason).
    """
    details = details or {}
    severity = self.classify_action(action_type, details)

    action = SensitiveAction(
        action_type=action_type,
        description=description,
        severity=severity,
        details=details
    )

    # Log all actions
    self.action_log.append({
        "action": action,
        "timestamp": __import__('datetime').datetime.now().isoformat()
    })

    # Auto-approve low severity
    if severity == ActionSeverity.LOW and self.auto_confirm_low:
        return True, "auto-approved (low severity)"

    # Maybe auto-approve medium
    if severity == ActionSeverity.MEDIUM and self.auto_confirm_medium:
        return True, "auto-approved (medium severity)"

    # Request confirmation
    approved = self.confirm_callback(action)

    if approved:
        return True, "user approved"
    else:
        return False, "user rejected"

class ConfirmedComputerUseAgent:
"""
Computer use agent with confirmation gates.
"""

def __init__(self, base_agent, confirmation_gate: ConfirmationGate):
    self.agent = base_agent
    self.gate = confirmation_gate

def execute_action(self, action: dict) -> dict:
    """Execute action with confirmation check."""
    action_type = action.get("type", "unknown")

    # Build description
    if action_type == "click":
        desc = f"Click at ({action.get('x')}, {action.get('y')})"
    elif action_type == "type":
        text = action.get('text', '')
        # Mask if looks like password
        if self._looks_sensitive(text):
            desc = f"Type sensitive text ({len(text)} chars)"
        else:
            desc = f"Type: {text[:50]}..."
    else:
        desc = f"Execute: {action_type}"

    # Context for severity classification
    context = {
        "involves_credentials": self._looks_sensitive(action.get("text", "")),
        "involves_money": self._mentions_money(action),
    }

    # Check with gate
    approved, reason = self.gate.check_action(
        action_type, desc, context
    )

    if not approved:
        return {
            "success": False,
            "error": f"Action blocked: {reason}",
            "action": action_type
        }

    # Execute if approved
    return self.agent.execute_action(action)

def _looks_sensitive(self, text: str) -> bool:
    """Check if text looks like sensitive data."""
    if not text:
        return False
    # Common patterns
    patterns = [
        r'\b\d{16}\b',  # Credit card
        r'\b\d{3,4}\b.*\b\d{3,4}\b',  # CVV-like
        r'password',
        r'secret',
        r'api.?key',
        r'token'
    ]
    import re
    return any(re.search(p, text.lower()) for p in patterns)

def _mentions_money(self, action: dict) -> bool:
    """Check if action involves money."""
    text = str(action)
    money_patterns = [
        r'\$\d+', r'pay', r'purchase', r'buy', r'checkout',
        r'credit', r'debit', r'invoice', r'payment'
    ]
    import re
    return any(re.search(p, text.lower()) for p in money_patterns)

Usage

gate = ConfirmationGate(
auto_confirm_low=True,
auto_confirm_medium=False # Confirm clicks, typing
)

agent = ConfirmedComputerUseAgent(base_agent, gate)
result = agent.execute_action({"type": "click", "x": 500, "y": 300})

Anti_patterns

  • Auto-approving all actions
  • Not logging rejected actions
  • Showing full passwords in confirmation
  • No timeout on confirmation (hangs forever)

Action Logging Pattern

All computer use agent actions should be logged for:

  1. Debugging failed automations
  2. Security auditing
  3. Reproducibility
  4. Compliance requirements

Log format should capture:

  • Timestamp
  • Action type and parameters
  • Screenshot before/after
  • Success/failure status
  • Model reasoning (if available)

When to use: Production computer use deployments,Debugging automation failures,Security-sensitive environments

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional, Any
import json
import os

@dataclass
class ActionLogEntry:
"""Single action log entry."""
timestamp: datetime
action_type: str
parameters: dict
success: bool
error: Optional[str] = None
screenshot_before: Optional[str] = None # Path to screenshot
screenshot_after: Optional[str] = None
model_reasoning: Optional[str] = None
duration_ms: Optional[int] = None

def to_dict(self) -> dict:
    return {
        "timestamp": self.timestamp.isoformat(),
        "action_type": self.action_type,
        "parameters": self._sanitize_params(self.parameters),
        "success": self.success,
        "error": self.error,
        "screenshot_before": self.screenshot_before,
        "screenshot_after": self.screenshot_after,
        "model_reasoning": self.model_reasoning,
        "duration_ms": self.duration_ms
    }

def _sanitize_params(self, params: dict) -> dict:
    """Remove sensitive data from params."""
    sanitized = {}
    sensitive_keys = ['password', 'secret', 'token', 'key', 'credit_card']

    for k, v in params.items():
        if any(s in k.lower() for s in sensitive_keys):
            sanitized[k] = "[REDACTED]"
        elif isinstance(v, str) and len(v) > 100:
            sanitized[k] = v[:100] + "...[truncated]"
        else:
            sanitized[k] = v

    return sanitized

@dataclass
class TaskSession:
"""A complete task execution session."""
session_id: str
task: str
start_time: datetime
end_time: Optional[datetime] = None
actions: list[ActionLogEntry] = field(default_factory=list)
success: bool = False
final_result: Optional[str] = None

class ActionLogger:
"""
Comprehensive action logging for computer use agents.
"""

def __init__(self, log_dir: str = "./agent_logs"):
    self.log_dir = log_dir
    self.screenshot_dir = os.path.join(log_dir, "screenshots")
    os.makedirs(self.screenshot_dir, exist_ok=True)

    self.current_session: Optional[TaskSession] = None

def start_session(self, task: str) -> str:
    """Start a new task session."""
    import uuid
    session_id = str(uuid.uuid4())[:8]

    self.current_session = TaskSession(
        session_id=session_id,
        task=task,
        start_time=datetime.now()
    )

    return session_id

def log_action(
    self,
    action_type: str,
    parameters: dict,
    success: bool,
    error: Optional[str] = None,
    screenshot_before: bytes = None,
    screenshot_after: bytes = None,
    model_reasoning: str = None,
    duration_ms: int = None
):
    """Log a single action."""
    if not self.current_session:
        raise RuntimeError("No active session")

    # Save screenshots if provided
    screenshot_paths = {}
    timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S_%f")

    if screenshot_before:
        path = os.path.join(
            self.screenshot_dir,
            f"{self.current_session.session_id}_{timestamp_str}_before.png"
        )
        with open(path, "wb") as f:
            f.write(screenshot_before)
        screenshot_paths["before"] = path

    if screenshot_after:
        path = os.path.join(
            self.screenshot_dir,
            f"{self.current_session.session_id}_{timestamp_str}_after.png"
        )
        with open(path, "wb") as f:
            f.write(screenshot_after)
        screenshot_paths["after"] = path

    # Create log entry
    entry = ActionLogEntry(
        timestamp=datetime.now(),
        action_type=action_type,
        parameters=parameters,
        success=success,
        error=error,
        screenshot_before=screenshot_paths.get("before"),
        screenshot_after=screenshot_paths.get("after"),
        model_reasoning=model_reasoning,
        duration_ms=duration_ms
    )

    self.current_session.actions.append(entry)

    # Also append to running log file
    self._append_to_log(entry)

def _append_to_log(self, entry: ActionLogEntry):
    """Append entry to JSONL log file."""
    log_file = os.path.join(
        self.log_dir,
        f"session_{self.current_session.session_id}.jsonl"
    )

    with open(log_file, "a") as f:
        f.write(json.dumps(entry.to_dict()) + "\n")

def end_session(self, success: bool, result: str = None):
    """End current session."""
    if not self.current_session:
        return

    self.current_session.end_time = datetime.now()
    self.current_session.success = success
    self.current_session.final_result = result

    # Write session summary
    summary_file = os.path.join(
        self.log_dir,
        f"session_{self.current_session.session_id}_summary.json"
    )

    summary = {
        "session_id": self.current_session.session_id,
        "task": self.current_session.task,
        "start_time": self.current_session.start_time.isoformat(),
        "end_time": self.current_session.end_time.isoformat(),
        "duration_seconds": (
            self.current_session.end_time -
            self.current_session.start_time
        ).total_seconds(),
        "total_actions": len(self.current_session.actions),
        "successful_actions": sum(
            1 for a in self.current_session.actions if a.success
        ),
        "failed_actions": sum(
            1 for a in self.current_session.actions if not a.success
        ),
        "success": success,
        "final_result": result
    }

    with open(summary_file, "w") as f:
        json.dump(summary, f, indent=2)

    self.current_session = None

def get_session_replay(self, session_id: str) -> list[dict]:
    """Get all actions from a session for replay/debugging."""
    log_file = os.path.join(self.log_dir, f"session_{session_id}.jsonl")

    actions = []
    with open(log_file, "r") as f:
        for line in f:
            actions.append(json.loads(line))

    return actions

Integration with agent

class LoggedComputerUseAgent:
"""Computer use agent with comprehensive logging."""

def __init__(self, base_agent, logger: ActionLogger):
    self.agent = base_agent
    self.logger = logger

def run_task(self, task: str) -> dict:
    """Run task with full logging."""
    session_id = self.logger.start_session(task)

    try:
        result = self._run_with_logging(task)
        self.logger.end_session(
            success=result.get("success", False),
            result=result.get("result")
        )
        return result
    except Exception as e:
        self.logger.end_session(success=False, result=str(e))
        raise

def _run_with_logging(self, task: str) -> dict:
    """Internal run with action logging."""
    # This would wrap the base agent's run method
    # and log each action
    pass

Anti_patterns

  • Not sanitizing sensitive data in logs
  • Storing screenshots indefinitely (storage costs)
  • Not rotating log files
  • Logging synchronously (blocks agent)

Sharp Edges

Web Content Can Hijack Your Agent

Severity: CRITICAL

Situation: Computer use agent browsing the web

Symptoms:
Agent suddenly performs unexpected actions. Clicks malicious links.
Enters credentials on phishing sites. Downloads files it shouldn't.
Ignores your instructions and follows embedded commands instead.

Why this breaks:
"While all agents that process untrusted content are subject to prompt
injection risks, browser use amplifies this risk in two ways. First,
the attack surface is vast: every webpage, embedded document, advertisement,
and dynamically loaded script represents a potential vector for malicious
instructions. Second, browser agents can take many different actions-
navigating to URLs, filling forms, clicking buttons, downloading files-
that attackers can exploit."

Real attacks have already happened:

  • "Microsoft Copilot agents were hijacked with emails containing malicious
    instructions, which allowed attackers to extract entire CRM databases."
  • "Google's Workspace services were manipulated-hidden prompts inside
    calendar invites and emails tricked Gemini agents into deleting events
    and exposing sensitive messages."

Even a 1% attack success rate represents meaningful risk at scale.

Recommended fix:

Defense in depth - no single solution works

  1. Sandboxing (most effective):

    # Docker with strict isolation
    docker run \
        --security-opt no-new-privileges \
        --cap-drop ALL \
        --network none \  # No internet!
        --read-only \
        computer-use-agent
    
  2. Classifier-based detection:

    def scan_for_injection(content: str) -> bool:
        """Detect prompt injection attempts."""
        patterns = [
            r"ignore.*instructions",
            r"disregard.*previous",
            r"new.*instructions",
            r"you are now",
            r"act as if",
            r"pretend to be",
        ]
        return any(re.search(p, content.lower()) for p in patterns)
    
    # Check page content before processing
    page_text = await page.text_content("body")
    if scan_for_injection(page_text):
        return {"error": "Potential injection detected"}
    
  3. User confirmation for sensitive actions:

    SENSITIVE_ACTIONS = {"download", "submit", "login", "purchase"}
    
    if action_type in SENSITIVE_ACTIONS:
        if not await get_user_confirmation(action):
            return {"error": "User rejected action"}
    
  4. Scoped credentials:

    • Never give agent access to all credentials
    • Use temporary, limited tokens
    • Revoke after task completion

Vision Agents Click Exact Centers

Severity: MEDIUM

Situation: Agent clicking on UI elements

Symptoms:
Agent's clicks are detectable as non-human. Websites may block or
CAPTCHA the agent. Anti-bot systems flag the interaction.

Why this breaks:
"When a vision model identifies a button, it calculates the center.
Click coordinates land at mathematically precise positions-often exact
element centers or grid-aligned pixel values. Humans don't click centers;
their click distributions follow a Gaussian pattern around targets."

The screenshot loop also creates detectable patterns:
"Predictable pauses. Vision agents are completely still during their
'thinking' phase. The pattern looks like: Action → Complete stillness
(1-5 seconds) → Action → Complete stillness → Action."

Sophisticated anti-bot systems detect:

  • Perfect center clicks
  • No mouse movement during "thinking"
  • Consistent timing between actions
  • Lack of micro-movements and hesitation

Recommended fix:

Add human-like variance to actions

import random
import time

def humanized_click(x: int, y: int) -> tuple[int, int]:
    """Add human-like variance to click coordinates."""
    # Gaussian distribution around target
    # Humans typically land within ~10px of target
    x_offset = int(random.gauss(0, 5))
    y_offset = int(random.gauss(0, 5))

    return (x + x_offset, y + y_offset)

def humanized_delay():
    """Add human-like delay between actions."""
    # Humans have variable reaction times
    base_delay = random.uniform(0.3, 0.8)
    # Occasionally longer pauses (reading, thinking)
    if random.random() < 0.2:
        base_delay += random.uniform(0.5, 2.0)
    time.sleep(base_delay)

def humanized_movement(from_pos: tuple, to_pos: tuple):
    """Move mouse in curved path like human."""
    # Bezier curve or similar
    # Humans don't move in straight lines
    steps = random.randint(10, 20)
    for i in range(steps):
        t = i / steps
        # Simple curve approximation
        x = from_pos[0] + (to_pos[0] - from_pos[0]) * t
        y = from_pos[1] + (to_pos[1] - from_pos[1]) * t
        # Add wobble
        x += random.gauss(0, 2)
        y += random.gauss(0, 2)
        pyautogui.moveTo(int(x), int(y))
        time.sleep(0.01)

Rotate user agents and fingerprints

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/...",
    # ... more realistic agents
]

await page.set_extra_http_headers({
    "User-Agent": random.choice(USER_AGENTS)
})

Dropdowns, Scrollbars, and Drags Are Unreliable

Severity: HIGH

Situation: Agent interacting with complex UI elements

Symptoms:
Agent fails to select dropdown options. Scroll doesn't work as expected.
Drag and drop completely fails. Hover menus disappear before clicking.

Why this breaks:
"Computer Use currently struggles with certain interface interactions,
particularly scrolling, dragging, and zooming operations. Some UI elements
(like dropdowns and scrollbars)

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.