8 days ago

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Run-Ze Fan, Zengzhi Wang, Pengfei Liu

Abstract

Scientific reasoning is critical for developing AI scientists and supportinghuman researchers in advancing the frontiers of natural science discovery.However, the open-source community has primarily focused on mathematics andcoding while neglecting the scientific domain, largely due to the absence ofopen, large-scale, high-quality, verifiable scientific reasoning datasets. Tobridge this gap, we first present TextbookReasoning, an open dataset featuringtruthful reference answers extracted from 12k university-level scientifictextbooks, comprising 650k reasoning questions spanning 7 scientificdisciplines. We further introduce MegaScience, a large-scale mixture ofhigh-quality open-source datasets totaling 1.25 million instances, developedthrough systematic ablation studies that evaluate various data selectionmethodologies to identify the optimal subset for each publicly availablescientific dataset. Meanwhile, we build a comprehensive evaluation systemcovering diverse subjects and question types across 15 benchmarks,incorporating comprehensive answer extraction strategies to ensure accurateevaluation metrics. Our experiments demonstrate that our datasets achievesuperior performance and training efficiency with more concise response lengthscompared to existing open-source scientific datasets. Furthermore, we trainLlama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, whichsignificantly outperform the corresponding official instruct models in averageperformance. In addition, MegaScience exhibits greater effectiveness for largerand stronger models, suggesting a scaling benefit for scientific tuning. Werelease our data curation pipeline, evaluation system, datasets, and seventrained models to the community to advance scientific reasoning research.