HyperAI

OpenMathInstruct-2 Math Instruction Tuning Dataset

OpenMathInstruct-2 is a large-scale open source math instruction dataset released by NVIDIA in 2024, which aims to accelerate the progress of artificial intelligence in mathematics. The related paper results are "OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DataThe dataset contains 14 million question-answer pairs (about 600,000 unique questions), which is nearly 8 times larger than the previous largest dataset of its kind. By fine-tuning the Llama-3.1-8B-Base model with OpenMathInstruct-2, its performance on the MATH dataset is improved by 15.9% over Llama3.1-8B-Instruct (from 51.9% to 67.8%).

The OpenMathInstruct-2 dataset contains the following fields:

  • problem: Original problems, either from the GSM8K or MATH training sets, or problems augmented from these training sets.
  • generated_solution: The synthetically generated solution.
  • expected_answer: For questions in the training set, it is the true reference answer provided in the dataset. For augmented questions, it is the answer obtained by majority vote.
  • problem_source: Indicates that the problem is directly from GSM8K or MATH, or is an enhanced version derived from either dataset.
Example of dataset structure

OpenMathInstruct-2.torrent
Seeding 1Downloading 1Completed 62Total Downloads 43
  • OpenMathInstruct-2/
    • README.md
      1.85 KB
    • README.txt
      3.7 KB
      • data/
        • OpenMathInstruct-2.zip
          10.23 GB