HyperAI

The Leaderboard Illusion

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker
تاريخ النشر: 5/6/2025
The Leaderboard Illusion
الملخص

Measuring progress is fundamental to the advancement of any scientific field.As benchmarks play an increasingly central role, they also grow moresusceptible to distortion. Chatbot Arena has emerged as the go-to leaderboardfor ranking the most capable AI systems. Yet, in this work we identifysystematic issues that have resulted in a distorted playing field. We find thatundisclosed private testing practices benefit a handful of providers who areable to test multiple variants before public release and retract scores ifdesired. We establish that the ability of these providers to choose the bestscore leads to biased Arena scores due to selective disclosure of performanceresults. At an extreme, we identify 27 private LLM variants tested by Meta inthe lead-up to the Llama-4 release. We also establish that proprietary closedmodels are sampled at higher rates (number of battles) and have fewer modelsremoved from the arena than open-weight and open-source alternatives. Boththese policies lead to large data access asymmetries over time. Providers likeGoogle and OpenAI have received an estimated 19.2% and 20.4% of all data on thearena, respectively. In contrast, a combined 83 open-weight models have onlyreceived an estimated 29.7% of the total data. We show that access to ChatbotArena data yields substantial benefits; even limited additional data can resultin relative performance gains of up to 112% on the arena distribution, based onour conservative estimates. Together, these dynamics result in overfitting toArena-specific dynamics rather than general model quality. The Arena builds onthe substantial efforts of both the organizers and an open community thatmaintains this valuable evaluation platform. We offer actionablerecommendations to reform the Chatbot Arena's evaluation framework and promotefairer, more transparent benchmarking for the field