BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

In this paper, we introduce BMMR, a large-scale bilingual, multimodal,multi-disciplinary reasoning dataset for the community to develop and evaluatelarge multimodal models (LMMs). BMMR comprises 110k college-level questionsspanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice,fill-in-the-blank, and open-ended QA-and sourced from both print and digitalmedia such as books, exams, and quizzes. All data are curated and filtered viaa human-in-the-loop and scalable framework, and each instance is paired with ahigh-quality reasoning path. The dataset is organized into two parts: BMMR-Evalthat comprises 20,458 high-quality instances to comprehensively assess LMMs'knowledge and reasoning across multiple disciplines in both Chinese andEnglish; and BMMR-Train that contains 88,991 instances to support furtherresearch and development, extending the current focus on mathematical reasoningto diverse disciplines and domains. In addition, we propose the process-basedmulti-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grainedevaluation of reasoning paths. Extensive experiments on 24 models reveal that(i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroomon BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMsonly on specific subjects; (iii) open-source models still trail theirproprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap.Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and otherin-depth studies, uncovering the challenges LMMs currently face inmultidisciplinary reasoning. We will release the data, and we hope our work canoffer insights and contributions to the community.