MathX-5M Mathematical Reasoning Dataset
MathX is a mathematical reasoning dataset designed for instruction-based model tuning and fine-tuning of existing models to augment thinking capabilities. The dataset is the largest and most comprehensive public corpus of mathematical reasoning data to date.
The dataset includes 5 million carefully selected step-by-step thinking data examples, each of which contains: problem statement, detailed reasoning process, and verified correct solution. The examples cover arithmetic and number theory, algebra and polynomial mathematics, geometry and trigonometry, calculus and analysis.
Problem complexity distribution
- Basic level (30%): Basic mathematical concepts and operations
- Intermediate (30%): Multi-step problems requiring reasoning chains
- Advanced (40%): Complex Mathematical Challenges and Proofs
Dataset features:
- Diversity: Comprehensive coverage of mathematics from basic arithmetic to advanced calculus
- Quality: Multi-stage screening and verification process
- Reasoning: step-by-step solutions with detailed mathematical ideas
- Accuracy: Answers verified by reinforcement learning and verified for correctness