HyperAI

InfiMM-WebMath-40B Multimodal Mathematical Reasoning Dataset

Date

7 months ago

Size

73.61 GB

Organization

Chinese Academy of Sciences

Publish URL

huggingface.co

The InfiMM-WebMath-40B dataset was released by a research team from ByteDance and the Chinese Academy of Sciences in 2024. The related paper is titled “InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning".

This dataset is a large open-source multimodal dataset designed specifically for mathematical reasoning tasks, containing 2.4k web pages, 8.5k related image URLs, and 40 billion tokens, all of which have been carefully extracted and filtered from the CommonCrawl database (2019-2023). The release of this dataset provides a valuable resource for the open-source community to advance the capabilities of multimodal large language models (MLLMs) in mathematical reasoning.

The dataset construction process includes text extraction, language filtering, high-quality content filtering, deduplication, and extraction of image URLs. Through these steps, the quality and relevance of the dataset are ensured. In terms of model training, the InfiMM-WebMath-40B dataset is used for continued pre-training to enhance the model's ability to acquire mathematical knowledge in a multimodal setting. In addition, instruction fine-tuning is performed to further improve model performance.

InfiMM-WebMath-40B.torrent
Seeding 1Downloading 1Completed 82Total Downloads 83
  • InfiMM-WebMath-40B/
    • README.md
      1.83 KB
    • README.txt
      3.67 KB
      • data/
        • InfiMM-WebMath-40B.zip
          73.61 GB