[Mathematical Dataset Summary] Genius Doctor Terence Tao Recommends Data Sets! Including Code, Chinese Competition Questions, Forward and Reverse Question-answering, Etc.

Last week, the famous mathematician Terence Tao published a list of resources called "AI for Math Resourses" on his personal blog, aiming to help those who are interested in entering the field of artificial intelligence mathematics. This list was compiled by the "AI-Assisted Mathematical Reasoning" seminar. The seminar was jointly organized by the National Academy of Sciences, the Academy of Engineering, and the Academy of Medicine of the United States, and Terence Tao served as the host of the seminar.
This list document has not yet been finalized, and Tao and other researchers are still working on it. HyperAI has selected some data sets for everyone to download and use.In addition, we have also summarized other mathematical data sets to help AI for Math.
1.OpenWebMath Web Mathematics Dataset
Publishing Agency:University of Toronto, University of Cambridge, etc.
Release time:2023
Estimated size:44.21 GB
Download address:https://go.hyper.ai/erQGZ
OpenWebMath contains a large portion of high-quality mathematical text from the Internet. It is filtered and extracted from more than 200B HTML files on Common Crawl, resulting in a set of 6.3 million documents containing a total of 14.7B tokens.
2.Ape210K Chinese primary school level mathematics problems
Publishing Agency:Yuanfudao AI Lab, Northwestern University
Release time:2020
Estimated size:78.43 MB
Download address:https://go.hyper.ai/SL5to
Ape210K is a large-scale and template-rich math word problem dataset containing 210K Chinese elementary school-level math problems, each of which includes the best answer and the equation required to obtain the answer.
3.Proof-Pile-2 Mathematical Dataset
Publishing Agency:Princeton University
Release time:2023
Estimated size:47.57 GB
Download address:https://go.hyper.ai/TXmiP
Proof-Pile-2 is a tokenized dataset of 55 billion math and science documents, a blend of scientific papers, math-related web content, and math code, up to date as of April 2023.
4.Orca-Math-200K math problem dataset
Publishing Agency:Microsoft
Release time:2024
Estimated size:70.88 MB
Download address:https://go.hyper.ai/o4pMG
Orca-Math-200K is a high-quality math problem dataset created by Microsoft, containing approximately 200,000 elementary school math questions. All answers in this dataset are generated using Azure GPT4-Turbo.
Publishing Agency:Mizar
Release time:2018
Download address:https://go.hyper.ai/I8pi6
Mizar is a mathematical formalization library based on the Mizar language, which has been created and modified by many authors and maintainers over the years. So far, the Mizar language system has formed a huge Mizar Mathematical Library, which has laid a good foundation for future discussions on mathematics and related issues.
6.Math23K math word problem solving dataset
Publishing Agency:Tencent AI Lab
Release time:2017
Estimated size:8.36 MB
Download address:https://go.hyper.ai/2YsRR
Math23K is a dataset created for solving math word problems, containing 23,162 Chinese problems crawled from the Internet.
7. MathVista Mathematical Reasoning Dataset
Publishing Agency:Microsoft, University of Washington
Release time:2023
Estimated size:1.61 GB
Download address:https://go.hyper.ai/GHNsf
MathVista is a comprehensive mathematical reasoning benchmark in a visual environment. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which can be used to evaluate logical reasoning on jigsaw test graphs, algebraic reasoning on function graphs, and scientific reasoning on academic paper graphs, respectively.
8.MetaMathQA Mathematical Reasoning Dataset
Publishing Agency:Huawei, University of Cambridge
Release time:2023
Estimated size:84.34 MB
Download address:https://go.hyper.ai/Vy2iw
MetaMathQA is a broad-coverage, high-quality mathematical reasoning dataset consisting of 395K forward-reverse mathematical question-answer pairs generated by a large language model.
9.AlgoPuzzleVQA Multimodal Algorithmic Puzzle Dataset
Publishing Agency:Singapore University of Technology and Design
Release time:2024
Estimated size:157.85 MB
Download address:https://go.hyper.ai/mmzdn
The dataset contains 18 different puzzles covering diverse mathematical and algorithmic topics such as Boolean logic, combinatorics, graph theory, optimization, search, etc. The dataset generates puzzles from human-written code in an automated way, ensuring that the dataset can arbitrarily scale in reasoning complexity and dataset size.
10.TAL-SCQ5K Chinese Mathematics Competition Dataset
Publishing Agency:Good Future
Release time:2023
Estimated size:11.4 MB
Download address:https://go.hyper.ai/ZuYTB
TAL-SCQ5K is a set of high-quality Chinese mathematics competition datasets, including 5K Chinese mathematics competition questions (3K for training and 2K for testing), available in Chinese and English.
The above are the 10 mathematical classification data sets compiled by HyperAI. If you have resources that you want to include on the hyper.ai official website, you are welcome to leave a message or submit an article to tell us!
Read the original article to get more datasets.
About HyperAI
HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:
* Provide domestic accelerated download nodes for 1200+ public data sets
* Includes 300+ classic and popular online tutorials
* Interpretation of 100+ AI4Science paper cases
* Support 500+ related terms search
* Hosting the first complete Apache TVM Chinese documentation in China
Visit the official website to start your learning journey: