Loading learning content…
Loading learning content…
Understand the technical architecture behind computer use agents — vision models, action spaces, and state management.
Read through the lesson, mark it complete when the concept is clear, then move to the next lesson in the sequence or jump back to the module map.
At the technical core of every CUA is a multimodal AI model that:
The model must simultaneously understand visual layout, UI conventions, task context, and action history.
CUAs work with a defined set of primitive actions:
| Action | Description |
|---|---|
| click(x, y) | Click at screen coordinates |
| type(text) | Type text at current cursor position |
| key(key) | Press a keyboard key or combination |
| scroll(x, y, direction) | Scroll at coordinates |
| screenshot() | Capture current screen state |
| move(x, y) | Move mouse without clicking |
Complex tasks are composed from sequences of these primitives.
The agent operates in screen coordinates. It must accurately identify where UI elements are and click the right location. Errors in coordinate calculation cause wrong clicks, which compound as the task progresses.
High-DPI displays add complexity: the agent must account for device pixel ratios.
Multi-step tasks require the agent to track:
Most CUAs maintain this in the conversation context, passing the history of screenshots and actions to the model on each step.
Good CUAs handle failure gracefully:
Expected: Login form appears after clicking "Sign In" Actual: Error modal appeared
Recovery:
Error recovery is one of the hardest parts of building reliable CUAs.