HyperAI

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Veröffentlichungsdatum: 4/23/2025
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Abstract

As large language models (LLMs) continue to advance in linguisticcapabilities, robust multilingual evaluation has become essential for promotingequitable technological progress. This position paper examines over 2,000multilingual (non-English) benchmarks from 148 countries, published between2021 and 2024, to evaluate past, present, and future practices in multilingualbenchmarking. Our findings reveal that, despite significant investmentsamounting to tens of millions of dollars, English remains significantlyoverrepresented in these benchmarks. Additionally, most benchmarks rely onoriginal language content rather than translations, with the majority sourcedfrom high-resource countries such as China, India, Germany, the UK, and theUSA. Furthermore, a comparison of benchmark performance with human judgmentshighlights notable disparities. STEM-related tasks exhibit strong correlationswith human evaluations (0.70 to 0.85), while traditional NLP tasks likequestion answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30).Moreover, translating English benchmarks into other languages provesinsufficient, as localized benchmarks demonstrate significantly higheralignment with local human judgments (0.68) than their translated counterparts(0.47). This underscores the importance of creating culturally andlinguistically tailored benchmarks rather than relying solely on translations.Through this comprehensive analysis, we highlight six key limitations incurrent multilingual evaluation practices, propose the guiding principlesaccordingly for effective multilingual benchmarking, and outline five criticalresearch directions to drive progress in the field. Finally, we call for aglobal collaborative effort to develop human-aligned benchmarks that prioritizereal-world applications.