Skip to main content

Desktop & Computer Use

The platform provides cloud-hosted desktop environments that agents can operate through visual perception and action -- the same way a human uses a computer. This capability evolved from a multi-provider abstraction to a focused implementation called Meeseeks (the desktop automation module), and the feature was renamed from "Workspace" to "Desktop" to avoid confusion with the organizational concept of workspaces.

Cloud Virtual Machines

Desktops run as full Linux or Windows VMs provisioned through GCP and Kubernetes. Each VM gets:

  • A dedicated display (1920x1080 default resolution)
  • VNC access proxied through the platform
  • A Robot API for programmatic interaction
  • Network isolation within the user's Kubernetes namespace

VNC Access

Users view desktops in the browser through a WebSocket proxy chain:

Browser → Platform API → guacd (in trusted cluster) → Desktop VM (in untrusted cluster)

guacd is Apache Guacamole's connection daemon. It translates between the Guacamole protocol (which the browser client speaks over WebSocket) and VNC (which the desktop VM speaks). The platform proxies the WebSocket connection from the user's browser to guacd, which then connects to the actual VM.

info

guacd runs in the trusted cluster even though the desktop VMs run in the untrusted cluster. This is intentional -- guacd needs to be reachable from the platform API, and placing it in the trusted cluster avoids exposing VNC ports on the untrusted cluster's network.

The Meeseeks Computer-Use Loop

Meeseeks (the desktop automation module) enables agents to operate a computer visually. It uses a vision-capable model with the computer_use tool type to perceive and act on the desktop.

The core loop is conceptually simple:

How It Works

  1. Screenshot: The module captures a screenshot of the desktop via the Robot API
  2. Vision analysis: The screenshot is sent to the Claude API along with the task context and the computer_use tool definitions
  3. Action decision: Claude analyzes the screenshot and returns one or more tool calls -- click(x, y), type(text), scroll(direction), key(combo), etc.
  4. Execution: Each action is translated to a Robot API call and executed on the desktop
  5. Repeat: The loop continues until the task is complete or the safety limit is reached

Safety Limits

The loop runs for a maximum of 25 iterations per task. This prevents runaway agents from clicking endlessly if they get stuck in a loop (for example, repeatedly clicking a button that opens a dialog that covers the button, leading to an infinite open/close cycle).

warning

The 25-iteration limit is a safety guardrail, not a performance target. Most well-defined tasks complete in 5-10 iterations. If an agent consistently hits the limit, the task description likely needs to be more specific or broken into smaller steps.

Robot API

The Robot API exposes REST endpoints on each desktop VM for programmatic interaction:

EndpointAction
GET /screenshotCapture current screen as PNG
POST /clickClick at coordinates (x, y)
POST /typeType text string
POST /scrollScroll in a direction
POST /keyPress key combination
POST /dragDrag from point A to point B

Actions are mapped directly to the Claude computer_use tool outputs. When Claude returns a computer_use tool call with action: "click", coordinate: [500, 300], the Meeseeks module translates this to POST /click {"x": 500, "y": 300} on the Robot API.

Architecture Overview

The human user and the agent interact with the same desktop VM but through different channels. The user sees the desktop through VNC (and can watch the agent work in real time), while the agent operates through the Robot API. This means you can observe an agent navigating a browser, filling out forms, or operating desktop applications -- it looks exactly like watching someone use a remote desktop.

Robot API Access Control

The Robot API endpoints are only accessible from within the agent's Kubernetes namespace. Network policies prevent external access to the Robot API ports, so only agents deployed in the same namespace as the desktop VM can control it. This ensures that one user's agents cannot interact with another user's desktops.

Evolution from Multi-Provider to Meeseeks

The original design attempted to abstract over multiple computer-use providers. This was abandoned in favor of the Meeseeks approach because:

  1. Provider APIs diverged significantly -- abstracting over them produced a lowest-common-denominator interface that limited capabilities
  2. The computer_use tool type provided a well-defined, capable interface that covered the platform's needs
  3. Debugging was simpler with a single provider -- when something went wrong, there was one place to look

The rename from "Workspace" to "Desktop" happened because users confused desktop VMs with the organizational concept of workspaces (which group projects). The new naming makes the distinction clear: a workspace is where you organize work, a desktop is where an agent (or you) operates a computer.

See Also