MathPile-Commercial Mathematical Reasoning Pre-trained Corpus (Commercial Version)
Date
a year ago
Publish URL
Categories
MathPile-Commercial is the commercial version of MathPile.The data was obtained by removing documents that prohibit commercial use from MathPile (the latest version, v0.2). Specifically, the research team tested the source data for non-commercial use, using the license information in the metadata of the arXiv source and using keyword matching for other sources.
MathPile is a diverse and high-quality math-centric corpus containing approximately 9.5 billion tokens. This dataset is significantly different from previous datasets in the following characteristics:
- Mathematics-centered:MathPile focuses on serving the field of mathematics, unlike those corpora that focus on general fields such as Pile and RedPajama, or focus on multiple languages such as ROOTS and The Stack. Although there are mathematics-centric corpora, they are either closed source, such as Google's Minerva and OpenAI's MathMix; or lack diversity, such as ProofPile and OpenWebMath.
- Diversity:MathPile collects from a wide range of sources:Textbooks (including lecture notes), arXiv, Wikipedia, ProofWiki, StackExchange, and web pages.It contains mathematics content appropriate for K-12, college, graduate level, and mathematics competitions.In particular, the research team released a large collection of high-quality textbooks (about 0.19B tokens).
- high quality: The research team adheres to the principle of less is more and firmly believes in the superiority of data quality over quantity, even in the pre-training stage. The research team's meticulous data collection and processing efforts include a complex pre-processing, pre-screening, cleaning, screening and de-duplication suite, ensuring the high quality of the research team's corpus.
- Data Documentation: To enhance transparency, the research team has extensively documented MathPile. This includes a table of the dataset (see Table 5 in the paper) and quality annotations of the web source files, such as language identification scores and symbol-to-word ratios. This provides users with the flexibility to tailor the data to their needs.