How CUAs Work: Screenshots & Actions

The Vision-Action Loop

At the technical core of every CUA is a multimodal AI model that:

Receives a screenshot (raw pixels) as input
Processes it with computer vision to understand UI elements
Reasons about what action to take next
Returns a structured action command
Executes the action and captures a new screenshot

The model must simultaneously understand visual layout, UI conventions, task context, and action history.

The Action Space

CUAs work with a defined set of primitive actions:

Action	Description
click(x, y)	Click at screen coordinates
type(text)	Type text at current cursor position
key(key)	Press a keyboard key or combination
scroll(x, y, direction)	Scroll at coordinates
screenshot()	Capture current screen state
move(x, y)	Move mouse without clicking

Complex tasks are composed from sequences of these primitives.

Coordinate Systems

The agent operates in screen coordinates. It must accurately identify where UI elements are and click the right location. Errors in coordinate calculation cause wrong clicks, which compound as the task progresses.

High-DPI displays add complexity: the agent must account for device pixel ratios.

State Management

Multi-step tasks require the agent to track:

What has been done so far
What still needs to happen
Current context (which application, which window, which step)
What to do if something goes wrong

Most CUAs maintain this in the conversation context, passing the history of screenshots and actions to the model on each step.

Error Recovery

Good CUAs handle failure gracefully:

Expected: Login form appears after clicking "Sign In" Actual: Error modal appeared

Recovery:

Read the error message
Dismiss the modal
Diagnose the cause (wrong credentials? session expired?)
Attempt recovery or report failure

Error recovery is one of the hardest parts of building reliable CUAs.

The Vision-Action Loop

The Action Space

Coordinate Systems

State Management

Error Recovery

Manus & Computer Use Agents