HyperAI

Quesma, Inc. has unveiled OTelBench, the first independent benchmark designed to evaluate large language models (LLMs) on real-world OpenTelemetry instrumentation tasks, revealing a stark gap between AI’s coding prowess and its ability to perform essential Site Reliability Engineering (SRE) work in production environments. The results show that even the most advanced frontier models struggle significantly, with the best-performing model, Claude Opus 4.5, achieving only a 29% pass rate on the benchmark. This finding underscores a critical disconnect: while LLMs excel at generating code in controlled settings, they falter when faced with the complexity and precision required in real-world observability systems. For context, the same model scored 80.9% on SWE-Bench, a benchmark focused on software engineering tasks, highlighting that coding ability alone does not translate to production-grade reliability engineering. The benchmark evaluates models across multiple programming languages and instrumentation scenarios, including context propagation—essential for distributed tracing. Alarmingly, most models failed entirely on this core task, which is fundamental to understanding how requests flow across microservices. This failure is particularly concerning given that enterprise outages cost an average of $1.4 million per hour, making robust observability a mission-critical priority. Despite the challenges, some progress was observed. Models showed moderate success with Go and, unexpectedly, C++. Limited success was also recorded in JavaScript, PHP, .NET, and Python. However, only one model managed to complete a single task in Rust, and none succeeded in Swift, Ruby, or Java—languages widely used in enterprise environments. Jacek Migdał, founder of Quesma, emphasized the implications: “The backbone of the software industry relies on complex, high-scale production systems where reliability is non-negotiable. OTelBench demonstrates that even at small scale, today’s LLMs are not yet capable of handling fundamental instrumentation tasks or end-to-end problem-solving required in real SRE workflows. Many vendors are promoting AI-powered SRE tools with bold claims, but without independent validation.” Migdał likened the current state of AI SRE to DevOps anomaly detection in 2016—promising but largely unproven. “That’s why we’re releasing OTelBench as open-source: to establish a North Star for the community, enabling transparent, reproducible evaluation and tracking real progress beyond hype.” OTelBench is now available at https://quesma.com/benchmarks/otel/. Quesma provides independent evaluation and advanced simulation environments for frontier LLMs and AI agent developers, focusing on critical domains such as DevOps, security, and database migrations. The company is backed by Heartcore Capital, Inovo, Firestreak Ventures, and angel investors including Christina Beedgen, co-founder of Sumo Logic. For more information, visit www.quesma.com or follow Quesma on LinkedIn.

Related Links

Related Links

Related Links

Command Palette

Quesma Launches OTelBench: Benchmark Reveals Frontier LLMs Fail on Real-World SRE Tasks with Just 29% Pass Rate

Related Links

Command Palette

Quesma Launches OTelBench: Benchmark Reveals Frontier LLMs Fail on Real-World SRE Tasks with Just 29% Pass Rate

Related Links

Command Palette

Quesma Launches OTelBench: Benchmark Reveals Frontier LLMs Fail on Real-World SRE Tasks with Just 29% Pass Rate

Related Links