Desktop & Computer Use
The platform provides cloud-hosted desktop environments that agents can operate through visual perception and action -- the same way a human uses a computer. This capability evolved from a multi-provider abstraction to a focused implementation called Meeseeks (the desktop automation module), and the feature was renamed from "Workspace" to "Desktop" to avoid confusion with the organizational concept of workspaces.
Cloud Virtual Machines
Desktops run as full Linux or Windows VMs provisioned through GCP and Kubernetes. Each VM gets:
- A dedicated display (1920x1080 default resolution)
- VNC access proxied through the platform
- A Robot API for programmatic interaction
- Network isolation within the user's Kubernetes namespace
VNC Access
Users view desktops in the browser through a WebSocket proxy chain:
Browser → Platform API → guacd (in trusted cluster) → Desktop VM (in untrusted cluster)
guacd is Apache Guacamole's connection daemon. It translates between the Guacamole protocol (which the browser client speaks over WebSocket) and VNC (which the desktop VM speaks). The platform proxies the WebSocket connection from the user's browser to guacd, which then connects to the actual VM.
guacd runs in the trusted cluster even though the desktop VMs run in the untrusted cluster. This is intentional -- guacd needs to be reachable from the platform API, and placing it in the trusted cluster avoids exposing VNC ports on the untrusted cluster's network.
The Meeseeks Computer-Use Loop
Meeseeks (the desktop automation module) enables agents to operate a computer visually. It uses a vision-capable model with the computer_use tool type to perceive and act on the desktop.
The core loop is conceptually simple:
How It Works
- Screenshot: The module captures a screenshot of the desktop via the Robot API
- Vision analysis: The screenshot is sent to the Claude API along with the task context and the
computer_usetool definitions - Action decision: Claude analyzes the screenshot and returns one or more tool calls --
click(x, y),type(text),scroll(direction),key(combo), etc. - Execution: Each action is translated to a Robot API call and executed on the desktop
- Repeat: The loop continues until the task is complete or the safety limit is reached
Safety Limits
The loop runs for a maximum of 25 iterations per task. This prevents runaway agents from clicking endlessly if they get stuck in a loop (for example, repeatedly clicking a button that opens a dialog that covers the button, leading to an infinite open/close cycle).
The 25-iteration limit is a safety guardrail, not a performance target. Most well-defined tasks complete in 5-10 iterations. If an agent consistently hits the limit, the task description likely needs to be more specific or broken into smaller steps.
Robot API
The Robot API exposes REST endpoints on each desktop VM for programmatic interaction:
| Endpoint | Action |
|---|---|
GET /screenshot | Capture current screen as PNG |
POST /click | Click at coordinates (x, y) |
POST /type | Type text string |
POST /scroll | Scroll in a direction |
POST /key | Press key combination |
POST /drag | Drag from point A to point B |
Actions are mapped directly to the Claude computer_use tool outputs. When Claude returns a computer_use tool call with action: "click", coordinate: [500, 300], the Meeseeks module translates this to POST /click {"x": 500, "y": 300} on the Robot API.
Architecture Overview
The human user and the agent interact with the same desktop VM but through different channels. The user sees the desktop through VNC (and can watch the agent work in real time), while the agent operates through the Robot API. This means you can observe an agent navigating a browser, filling out forms, or operating desktop applications -- it looks exactly like watching someone use a remote desktop.
Robot API Access Control
The Robot API endpoints are only accessible from within the agent's Kubernetes namespace. Network policies prevent external access to the Robot API ports, so only agents deployed in the same namespace as the desktop VM can control it. This ensures that one user's agents cannot interact with another user's desktops.
Evolution from Multi-Provider to Meeseeks
The original design attempted to abstract over multiple computer-use providers. This was abandoned in favor of the Meeseeks approach because:
- Provider APIs diverged significantly -- abstracting over them produced a lowest-common-denominator interface that limited capabilities
- The
computer_usetool type provided a well-defined, capable interface that covered the platform's needs - Debugging was simpler with a single provider -- when something went wrong, there was one place to look
The rename from "Workspace" to "Desktop" happened because users confused desktop VMs with the organizational concept of workspaces (which group projects). The new naming makes the distinction clear: a workspace is where you organize work, a desktop is where an agent (or you) operates a computer.
See Also
- Provision a Cloud Desktop -- step-by-step guide to creating a cloud desktop
- Connect via VNC -- how to view and interact with a desktop in your browser
- Use Computer Use with an Agent -- configuring an agent to automate a desktop
- Desktop Automation Tutorial -- end-to-end walkthrough of desktop automation