PolyMath Multilingual Mathematical Reasoning Benchmark Dataset
PolyMath is a multilingual mathematical reasoning evaluation dataset released in 2025 by Alibaba's Qianwen team in collaboration with Shanghai Jiao Tong University. The related research paper is titled "...".PolyMath: Evaluating Mathematical Reasoning in Multilingual ContextsThe study has been selected for NeurIPS 2025 Datasets and Benchmarks, aiming to systematically evaluate the mathematical understanding, reasoning depth and cross-linguistic consistency performance of large language models under multilingual conditions.
This dataset contains 500 high-quality mathematical reasoning questions, with 125 questions provided for each difficulty level. It covers 18 languages and 4 difficulty levels, including 18 parallel language versions that cater to both high-resource and low-resource languages, covering more than 751,000 native speakers worldwide. The difficulty range extends from basic K-12 mathematics to Olympiad and cutting-edge mathematical fields, thus constructing a high-quality, multi-dimensional, and highly discriminative mathematical reasoning evaluation system.
Dataset distribution:
- Number and distribution of questions: Each language offers 125 questions at each difficulty level, forming a balanced difficulty composition.
- Difficulty classification criteria: Divided into four levels based on "Thought Depth" and "Knowledge Breadth":
- Level 1: Basics (K–12)
- Level 2: Advanced (High School to Upper Grades)
- Level 3: High difficulty (Olympiad level)
- Level 4: Cutting Edge (Advanced Mathematics and Research-Level Reasoning)
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.