InfiMM-WebMath-40B Multimodal Mathematical Reasoning Dataset
Date
Size
Publish URL
Categories
The InfiMM-WebMath-40B dataset was released by a research team from ByteDance and the Chinese Academy of Sciences in 2024. The related paper is titled “InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning".
This dataset is a large open-source multimodal dataset designed specifically for mathematical reasoning tasks, containing 2.4k web pages, 8.5k related image URLs, and 40 billion tokens, all of which have been carefully extracted and filtered from the CommonCrawl database (2019-2023). The release of this dataset provides a valuable resource for the open-source community to advance the capabilities of multimodal large language models (MLLMs) in mathematical reasoning.
The dataset construction process includes text extraction, language filtering, high-quality content filtering, deduplication, and extraction of image URLs. Through these steps, the quality and relevance of the dataset are ensured. In terms of model training, the InfiMM-WebMath-40B dataset is used for continued pre-training to enhance the model's ability to acquire mathematical knowledge in a multimodal setting. In addition, instruction fine-tuning is performed to further improve model performance.