HyperAIHyperAI

Command Palette

Search for a command to run...

MathPile Mathematical Reasoning Pre-trained Corpus

Date

2 years ago

Organization

Shanghai Jiao Tong University

Paper URL

arxiv.org

License

Other

Join the Discord Community

MathPile is a diverse and high-quality math-centric corpus containing approximately 9.5 billion tokens. This dataset is significantly different from previous datasets in the following characteristics:

  • Mathematics-centered:MathPile focuses on serving the field of mathematics, unlike those corpora that focus on general fields such as Pile and RedPajama, or focus on multiple languages such as ROOTS and The Stack. Although there are mathematics-centric corpora, they are either closed source, such as Google's Minerva and OpenAI's MathMix; or lack diversity, such as ProofPile and OpenWebMath.
  • Diversity:MathPile collects from a wide range of sources:Textbooks (including lecture notes), arXiv, Wikipedia, ProofWiki, StackExchange, and web pages.It contains mathematics content appropriate for K-12, college, graduate level, and mathematics competitions.In particular, the research team released a large collection of high-quality textbooks (about 0.19B tokens).
  • high quality: The research team adheres to the principle of less is more and firmly believes in the superiority of data quality over quantity, even in the pre-training stage. The research team's meticulous data collection and processing efforts include a complex pre-processing, pre-screening, cleaning, screening and de-duplication suite, ensuring the high quality of the research team's corpus.
  • Data Documentation: To enhance transparency, the research team has extensively documented MathPile. This includes a table of the dataset (see Table 5 in the paper) and quality annotations of the web source files, such as language identification scores and symbol-to-word ratios. This provides users with the flexibility to tailor the data to their needs.The research team also performed data contamination detection to eliminate duplicates from benchmark test sets such as MATH and MMLU-STEM.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp