Loading learning content…
Loading learning content…
Observability for agent fleets: what to track, how to alert, and how to debug.
Read through the lesson, mark it complete when the concept is clear, then move to the next lesson in the sequence or jump back to the module map.
Traditional software observability focuses on latency, error rates, and throughput. Agent systems add a new dimension: quality. An agent can execute successfully (no errors, normal latency) while producing completely wrong results.
You need metrics for both operational health and output quality.
Operational:
Quality:
Every agent invocation should produce a trace: a timeline of every tool call, model call, and decision, with inputs and outputs at each step.
Use a tracing tool (Langfuse, Langsmith, or custom) to capture these traces. Good traces make debugging 10x faster.
| Metric | Warning | Critical |
|---|---|---|
| Error rate | >2% | >10% |
| Latency p95 | >5s | >30s |
| Cost/task | 2x baseline | 5x baseline |
| Completion rate | <95% | <80% |
When an agent fails:
Never debug agent failures by running the full system repeatedly — too slow and too noisy.