Date

2 years ago

Size

73.61 GB

Organization

Tags

Multimodal

LLM

Mathematics

Multimodal Representation

Model Training

The InfiMM-WebMath-40B dataset was released by a research team from ByteDance and the Chinese Academy of Sciences in 2024. The related paper is titled “InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning". This dataset is a large open-source multimodal dataset designed specifically for mathematical reasoning tasks, containing 2.4k web pages, 8.5k related image URLs, and 40 billion tokens, all of which have been carefully extracted and filtered from the CommonCrawl database (2019-2023). The release of this dataset provides a valuable resource for the open-source community to advance the capabilities of multimodal large language models (MLLMs) in mathematical reasoning. The dataset construction process includes text extraction, language filtering, high-quality content filtering, deduplication, and extraction of image URLs. Through these steps, the quality and relevance of the dataset are ensured. In terms of model training, the InfiMM-WebMath-40B dataset is used for continued pre-training to enhance the model's ability to acquire mathematical knowledge in a multimodal setting. In addition, instruction fine-tuning is performed to further improve model performance.

Citation

@misc{han2024infimmwebmath40badvancingmultimodalpretraining, title={InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning}, author={Xiaotian Han and Yiren Jian and Xuefeng Hu and Haogeng Liu and Yiqi Wang and Qihang Fan and Yuang Ai and Huaibo Huang and Ran He and Zhenheng Yang and Quanzeng You}, year={2024}, eprint={2409.12568}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2409.12568}, }

InfiMM-WebMath-40B.torrent

Seeding 1Downloading 0Completed 268Total Downloads 367

InfiMM-WebMath-40B/
- README.md
  1.83 KB
- README.txt
  3.67 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

FigureBench Scientific Illustration Generation Benchmark Dataset

Command Palette

InfiMM-WebMath-40B Multimodal Mathematical Reasoning Dataset

Citation

Build AI with AI

HyperAI Newsletters

Command Palette

InfiMM-WebMath-40B Multimodal Mathematical Reasoning Dataset

Citation

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

Verbatim Spans Query Condition Evidence Extraction Dataset

SAM 3D Artist Objects 3D Object Reconstruction Dataset

Nemotron-SFT-Math-v4 Mathematical Inference SFT Dataset

FigureBench Scientific Illustration Generation Benchmark Dataset

Noisy Medical Document Image Dataset

ChartNet Chart Understanding Multimodal Dataset

TACK Targeted Chimera Knowledge Base Dataset

EAVSD E-commerce Advertising Video Storyboard Dataset

SMOL Multilingual Translation Parallel Dataset

chi-bench Medical Intelligent Agent Benchmark Evaluation Dataset

ViMU Video Metaphor Understanding Dataset

MemLens Multimodal Long Context Benchmark Dataset

VisCoR-55K Visual Inference Dataset

MathNet Multimodal Mathematical Benchmark Inference Dataset

Claw-Eval Real-World Benchmark Dataset

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

OmniParsingBench Multimodal Parsing Capability Evaluation Dataset

GPT-5.4-step-by-step-reasoning Dataset

ToolACE Complex Tools Learning Dialogue Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

InfiMM-WebMath-40B Multimodal Mathematical Reasoning Dataset

Citation

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

Verbatim Spans Query Condition Evidence Extraction Dataset

SAM 3D Artist Objects 3D Object Reconstruction Dataset

Nemotron-SFT-Math-v4 Mathematical Inference SFT Dataset

FigureBench Scientific Illustration Generation Benchmark Dataset

Noisy Medical Document Image Dataset

ChartNet Chart Understanding Multimodal Dataset

TACK Targeted Chimera Knowledge Base Dataset

EAVSD E-commerce Advertising Video Storyboard Dataset

SMOL Multilingual Translation Parallel Dataset

chi-bench Medical Intelligent Agent Benchmark Evaluation Dataset

ViMU Video Metaphor Understanding Dataset

MemLens Multimodal Long Context Benchmark Dataset

VisCoR-55K Visual Inference Dataset

MathNet Multimodal Mathematical Benchmark Inference Dataset

Claw-Eval Real-World Benchmark Dataset

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

OmniParsingBench Multimodal Parsing Capability Evaluation Dataset

GPT-5.4-step-by-step-reasoning Dataset

ToolACE Complex Tools Learning Dialogue Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset

Verbatim Spans Query Condition Evidence Extraction Dataset

SAM 3D Artist Objects 3D Object Reconstruction Dataset

Nemotron-SFT-Math-v4 Mathematical Inference SFT Dataset

FigureBench Scientific Illustration Generation Benchmark Dataset

Noisy Medical Document Image Dataset

ChartNet Chart Understanding Multimodal Dataset

TACK Targeted Chimera Knowledge Base Dataset

EAVSD E-commerce Advertising Video Storyboard Dataset

SMOL Multilingual Translation Parallel Dataset

chi-bench Medical Intelligent Agent Benchmark Evaluation Dataset

ViMU Video Metaphor Understanding Dataset

MemLens Multimodal Long Context Benchmark Dataset

VisCoR-55K Visual Inference Dataset

MathNet Multimodal Mathematical Benchmark Inference Dataset

Claw-Eval Real-World Benchmark Dataset

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

OmniParsingBench Multimodal Parsing Capability Evaluation Dataset

GPT-5.4-step-by-step-reasoning Dataset

ToolACE Complex Tools Learning Dialogue Dataset

Related Datasets

MAKIEVAL Multilingual Cultural Knowledge Assessment Dataset