ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu

Release Date: 5/28/2025

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic
Scientific Workflows

Abstract

Large Language Models (LLMs) have extended their impact beyond NaturalLanguage Processing, substantially fostering the development ofinterdisciplinary research. Recently, various LLM-based agents have beendeveloped to assist scientific discovery progress across multiple aspects anddomains. Among these, computer-using agents, capable of interacting withoperating systems as humans do, are paving the way to automated scientificproblem-solving and addressing routines in researchers' workflows. Recognizingthe transformative potential of these agents, we introduce ScienceBoard, whichencompasses two complementary contributions: (i) a realistic, multi-domainenvironment featuring dynamic and visually rich scientific workflows withintegrated professional software, where agents can autonomously interact viadifferent interfaces to accelerate complex research tasks and experiments; and(ii) a challenging benchmark of 169 high-quality, rigorously validatedreal-world tasks curated by humans, spanning scientific-discovery workflows indomains such as biochemistry, astronomy, and geoinformatics. Extensiveevaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude3.7, UI-TARS) show that, despite some promising results, they still fall shortof reliably assisting scientists in complex workflows, achieving only a 15%overall success rate. In-depth analysis further provides valuable insights foraddressing current agent limitations and more effective design principles,paving the way to build more capable agents for scientific discovery. Our code,environment, and benchmark are athttps://qiushisun.github.io/ScienceBoard-Home/.

View Paper Details View Code