a month ago

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Yilun Zhao, Chengye Wang, Chuhan Li, Arman Cohan

Abstract

This paper introduces MISS-QA, the first benchmark specifically designed toevaluate the ability of models to interpret schematic diagrams withinscientific literature. MISS-QA comprises 1,500 expert-annotated examples over465 scientific papers. In this benchmark, models are tasked with interpretingschematic diagrams that illustrate research overviews and answeringcorresponding information-seeking questions based on the broader context of thepaper. We assess the performance of 18 frontier multimodal foundation models,including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significantperformance gap between these models and human experts on MISS-QA. Our analysisof model performance on unanswerable questions and our detailed error analysisfurther highlight the strengths and limitations of current models, offering keyinsights to enhance models in comprehending multimodal scientific literature.