Learning CenterManus & Computer Use AgentsHow CUAs Work: Screenshots & Actions
Intermediate7 min read

How CUAs Work: Screenshots & Actions

Understand the technical architecture behind computer use agents — vision models, action spaces, and state management.

The Vision-Action Loop

At the technical core of every CUA is a multimodal AI model that:

  1. Receives a screenshot (raw pixels) as input
  2. Processes it with computer vision to understand UI elements
  3. Reasons about what action to take next
  4. Returns a structured action command
  5. Executes the action and captures a new screenshot

The model must simultaneously understand visual layout, UI conventions, task context, and action history.

The Action Space

CUAs work with a defined set of primitive actions:

| Action | Description | |--------|------------| | click(x, y) | Click at screen coordinates | | type(text) | Type text at current cursor position | | key(key) | Press a keyboard key or combination | | scroll(x, y, direction) | Scroll at coordinates | | screenshot() | Capture current screen state | | move(x, y) | Move mouse without clicking |

Complex tasks are composed from sequences of these primitives.

Coordinate Systems

The agent operates in screen coordinates. It must accurately identify where UI elements are and click the right location. Errors in coordinate calculation cause wrong clicks, which compound as the task progresses.

High-DPI displays add complexity: the agent must account for device pixel ratios.

State Management

Multi-step tasks require the agent to track:

  • What has been done so far
  • What still needs to happen
  • Current context (which application, which window, which step)
  • What to do if something goes wrong

Most CUAs maintain this in the conversation context, passing the history of screenshots and actions to the model on each step.

Error Recovery

Good CUAs handle failure gracefully:

Expected: Login form appears after clicking "Sign In" Actual: Error modal appeared

Recovery:

  1. Read the error message
  2. Dismiss the modal
  3. Diagnose the cause (wrong credentials? session expired?)
  4. Attempt recovery or report failure

Error recovery is one of the hardest parts of building reliable CUAs.

Loading…