HyperAI

MMLU-Pro Large-Scale Multi-Task Understanding Dataset

Date

8 months ago

Size

3.48 MB

Publish URL

github.com

* This dataset supports online use.Click here to jump.

The MMLU-Pro dataset is a more powerful and challenging large-scale multi-task understanding dataset designed to more rigorously benchmark the capabilities of large language models. The dataset contains 12K complex questions across disciplines. The dataset was released in 2024 by researchers from the University of Waterloo, the University of Toronto, and Carnegie Mellon University. The related paper results are "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark".

  • Questions and options:Each question in the dataset typically has 10 multiple-choice options, but during manual review, some options were reduced to eliminate unreasonable options. Each question originally had 4 options, and the increase in options is to increase complexity and robustness, which requires deeper reasoning to find the correct answer among a large number of potential noise items.
  • source:This dataset combines questions from multiple sources:
    • Original MMLU question:Part of the dataset comes from the original MMLU dataset. We remove trivial and ambiguous questions.
    • STEM Websites:Carefully select high-quality STEM questions from the Internet.
    • TheoremQA:High-quality human annotation problems that require theorems to solve.
    • SciBench:Science questions for university exams.
  • Newly added data covers the following subjects:Subjects enhanced with questions from STEM websites, TheoremQA, and SciBench include biology, business, chemistry, computer science, economics, engineering, mathematics, physics, and psychology.

Compared with the original MMLU, there are three main differences:

  • The original MMLU dataset contains only 4 options, and MMLU-Pro increases it to 10 options. The increase in options will make the evaluation more realistic and challenging. Random guessing will result in a much lower score.
  • The original MMLU dataset mainly contains knowledge-driven questions that do not require much reasoning. Therefore, PPL results are usually better than CoT. By increasing the difficulty of questions and integrating more reasoning-focused questions in MMLU-Pro, CoT can be 20% higher than PPL.
  • By increasing the number of distractors, MMLU-Pro significantly reduces the probability of guessing correctly by chance, thereby improving the robustness of the baseline. Specifically, after testing 24 different prompt styles, the sensitivity of the model score to prompt changes decreased from 4-5% in MMLU to 2% in MMLU-Pro.
MMLU-Pro.torrent
Seeding 1Downloading 1Completed 99Total Downloads 310
  • MMLU-Pro/
    • README.md
      2.88 KB
    • README.txt
      5.75 KB
      • data/
        • MMLU-Pro.zip
          3.48 MB