SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

We present SciArena, an open and collaborative platform for evaluatingfoundation models on scientific literature tasks. Unlike traditional benchmarksfor scientific literature understanding and synthesis, SciArena engages theresearch community directly, following the Chatbot Arena evaluation approach ofcommunity voting on model comparisons. By leveraging collective intelligence,SciArena offers a community-driven evaluation of model performance onopen-ended scientific tasks that demand literature-grounded, long-formresponses. The platform currently supports 23 open-source and proprietaryfoundation models and has collected over 13,000 votes from trusted researchersacross diverse scientific domains. We analyze the data collected so far andconfirm that the submitted questions are diverse, aligned with real-worldliterature needs, and that participating researchers demonstrate strongself-consistency and inter-annotator agreement in their evaluations. We discussthe results and insights based on the model ranking leaderboard. To furtherpromote research in building model-based automated evaluation systems forliterature tasks, we release SciArena-Eval, a meta-evaluation benchmark basedon our collected preference data. The benchmark measures the accuracy of modelsin judging answer quality by comparing their pairwise assessments with humanvotes. Our experiments highlight the benchmark's challenges and emphasize theneed for more reliable automated evaluation methods.