Measuring Agents in Production

Central Theme

This paper presents the first large-scale systematic study of AI agents operating in real-world production environments. Through a survey of 306 practitioners and 20 in-depth case studies across 26 industries, the authors investigate why organizations build agents, the technical strategies used to deploy them, and the specific challenges that persist in production versus research environments.

Key Findings

Motivation and Use Cases: The primary driver for deploying agents is increasing productivity (73%) by automating routine tasks. Unlike fully autonomous systems, 93% of production agents serve human users directly, functioning as tools to augment human workflows rather than replace them. High latency (minutes) is often tolerated because agents are still faster than the human processes they replace.
Development Techniques: Simplicity and controllability dominate production architectures.
- Models: 70% of teams rely on prompting off-the-shelf frontier models (e.g., GPT-4, Claude) rather than fine-tuning weights.
- Prompting: 79% rely on manual prompt engineering rather than automated optimization.
- Autonomy: Agents are highly constrained; 68% execute fewer than 10 steps before requiring human intervention.
- Frameworks: While survey data shows framework usage (e.g., LangChain), 85% of successful production case studies build custom in-house implementations to avoid dependency bloat and maintain control.
Evaluation: Human-in-the-loop (HITL) remains the gold standard for evaluation (74%). Automated benchmarks are difficult to create for domain-specific tasks. While 52% use LLM-as-a-judge, it is almost always combined with human verification to ensure correctness.

Top Challenges

Reliability is the primary bottleneck. Ensuring correctness and evaluating non-deterministic outputs are the most significant hurdles. Latency and security are secondary concerns that are currently managed through architectural constraints (e.g., sandboxed environments, read-only permissions) and asynchronous processing.

Conclusion

There is a distinct gap between research ambitions and production reality. Successful production agents succeed not by achieving high autonomy, but by prioritizing reliability through constraints. Practitioners deliberately trade off capability for control, utilizing simple workflows and heavy human oversight to deliver value. The study suggests that the path to future agentic AI lies in solving reliability issues to gradually allow for expanded autonomy.

Mentoring question

Given that 85% of successful production teams build custom implementations rather than using popular agent frameworks, are you introducing unnecessary complexity and dependencies into your agent architecture that might hinder control and reliability?

Source: https://arxiv.org/html/2512.04123v1