How CUAs Work: Screenshots & Actions
Understand the technical architecture behind computer use agents — vision models, action spaces, and state management.
The Vision-Action Loop
At the technical core of every CUA is a multimodal AI model that:
- Receives a screenshot (raw pixels) as input
- Processes it with computer vision to understand UI elements
- Reasons about what action to take next
- Returns a structured action command
- Executes the action and captures a new screenshot
The model must simultaneously understand visual layout, UI conventions, task context, and action history.
The Action Space
CUAs work with a defined set of primitive actions:
| Action | Description | |--------|------------| | click(x, y) | Click at screen coordinates | | type(text) | Type text at current cursor position | | key(key) | Press a keyboard key or combination | | scroll(x, y, direction) | Scroll at coordinates | | screenshot() | Capture current screen state | | move(x, y) | Move mouse without clicking |
Complex tasks are composed from sequences of these primitives.
Coordinate Systems
The agent operates in screen coordinates. It must accurately identify where UI elements are and click the right location. Errors in coordinate calculation cause wrong clicks, which compound as the task progresses.
High-DPI displays add complexity: the agent must account for device pixel ratios.
State Management
Multi-step tasks require the agent to track:
- What has been done so far
- What still needs to happen
- Current context (which application, which window, which step)
- What to do if something goes wrong
Most CUAs maintain this in the conversation context, passing the history of screenshots and actions to the model on each step.
Error Recovery
Good CUAs handle failure gracefully:
Expected: Login form appears after clicking "Sign In" Actual: Error modal appeared
Recovery:
- Read the error message
- Dismiss the modal
- Diagnose the cause (wrong credentials? session expired?)
- Attempt recovery or report failure
Error recovery is one of the hardest parts of building reliable CUAs.