OpenMathInstruct-2 Math Instruction Tuning Dataset
Date
7 months ago
Size
10.23 GB
Publish URL
OpenMathInstruct-2 is a large-scale open source math instruction dataset released by NVIDIA in 2024, which aims to accelerate the progress of artificial intelligence in mathematics. The related paper results are "OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DataThe dataset contains 14 million question-answer pairs (about 600,000 unique questions), which is nearly 8 times larger than the previous largest dataset of its kind. By fine-tuning the Llama-3.1-8B-Base model with OpenMathInstruct-2, its performance on the MATH dataset is improved by 15.9% over Llama3.1-8B-Instruct (from 51.9% to 67.8%).
The OpenMathInstruct-2 dataset contains the following fields:
- problem: Original problems, either from the GSM8K or MATH training sets, or problems augmented from these training sets.
- generated_solution: The synthetically generated solution.
- expected_answer: For questions in the training set, it is the true reference answer provided in the dataset. For augmented questions, it is the answer obtained by majority vote.
- problem_source: Indicates that the problem is directly from GSM8K or MATH, or is an enhanced version derived from either dataset.

OpenMathInstruct-2.torrent
Seeding 1Downloading 1Completed 62Total Downloads 43