AI evaluations emerge as new compute bottleneck
AI evaluation has evolved from a negligible cost into a primary bottleneck, reshaping access to research and benchmarking. Recent data from the Holistic Agent Leaderboard reveals that running 21,730 agent rollouts across nine models and benchmarks cost approximately $40,000. This marks a significant shift from the era when training was the dominant expense and evaluation was cheap. For context, a single run of the GAIA benchmark on a frontier model can exceed $2,800 without caching, and a full sweep across agent configurations by Exgentic showed a thirty-three-fold cost disparity for identical tasks. The cost crisis affects all evaluation categories. In scientific machine learning, evaluating a single new architecture on The Well benchmark consumes roughly 960 H100-hours, or about $2,400, with a full baseline sweep reaching $9,600. Similarly, the MLE-Bench requires approximately 1,800 GPU hours for a single seed across 75 competitions, totaling around $100,000 when including API costs for multiple models. These figures demonstrate that in specific domains, credible evaluation now costs more than training the model itself. While static benchmarks like HELM faced similar cost spikes years ago, they benefited from compression techniques such as Flash-HELM and tinyBenchmarks, which reduced compute requirements by up to 200 times without losing ranking fidelity. However, these methods struggle with modern agent and training-in-the-loop benchmarks. Agent evaluations involve multi-turn interactions where scaffold choices and token budgets drive costs, making them resistant to simple subsampling. Furthermore, reliability remains a major expense. A single accuracy measurement offers little statistical power; achieving reliable results often requires multiple runs, which can multiply the initial cost by eight or more. For instance, a statistically robust evaluation on the HAL benchmark could escalate from $40,000 to $320,000. This financial barrier creates a divide between well-funded industry labs and smaller academic groups. Independent evaluation of frontier models is becoming prohibitively expensive for researchers without massive budgets. A single PaperBench evaluation can cost nearly $10,000, and comprehensive comparisons across six models and three seeds can exceed $150,000. Consequently, the ability to validate AI systems is becoming concentrated within the very organizations developing them, potentially skewing external oversight. Current leaderboards often exacerbate the problem by reporting raw accuracy without accounting for cost, encouraging wasteful spending. Some benchmarks even show that increased inference compute can reduce accuracy. To mitigate these challenges, the community is advocating for standardized data sharing. Projects like the EvalEval Coalition's Every Eval Ever initiative aim to archive detailed evaluation traces, allowing researchers to reuse data rather than paying retail for repeated runs. Sharing full grading logs and tool-call histories could offer more cost savings than any compression technique. As evaluation becomes the new compute constraint, the field must prioritize reliability over single-run metrics. Without a shift toward cost-conscious benchmarking and open data sharing, the ability to scientifically assess AI capabilities will remain limited to those who can afford the highest entry fees. The economics have fundamentally changed: whoever controls the evaluation budget now defines the leaderboard.
