HyperAI

Massive Multi-Task Language Understanding (MMLU)

Massive Multi-task Language Understanding (MMLU) is a comprehensive evaluation.Aims to measure the multi-task accuracy of text models by evaluating models in zero-shot and few-shot settings.MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING" was proposed in 2021 and published in ICLR 2021.

MMLU provides a way to test and compare various language models, such as OpenAI GPT-4, Mistral 7b, Google Gemini, and Anthropic Claude 2. It covers 57 tasks in basic mathematics, US history, computer science, and law, requiring the model to demonstrate its broad knowledge base and problem-solving ability.

Key details of the MMLU benchmark

  • Training and validation sets: The dataset contains 15,908 questions, divided into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per topic, the validation set can be used to select hyperparameters and consists of 1540 questions, and the test set has 14,079 questions.
  • Model performance: Preliminary results from MMLU show that the smaller LLM performs at random level in accuracy (25% accuracy), while the larger GPT-3 (175 billion parameters) performs better, with a few-shot accuracy of 43.9% and a zero-shot accuracy of 37.7%. By 2023, GPT-4 achieved a 5-shot accuracy of 86.4% and Google Gemini achieved a 5-shot accuracy of 83.7%. However, even the best models still need substantial improvements before reaching human expert-level accuracy (89.8%).
  • Challenging subjects: Models, especially large language models (LLMs), perform poorly on computationally intensive tasks (such as physics and mathematics) and humanities topics (such as ethics and law).

Key Features of the MMLU Benchmark

The MMLU benchmark measures the performance of language models on a variety of tasks, covering subjects such as STEM, humanities, and social sciences. Some of the key features of the MMLU benchmark include:

  • 57 subjects: The benchmark covers 57 subjects in a wide range of areas, from basic mathematics to advanced professional levels in areas such as law and ethics.
  • Granularity and breadth: MMLU tests world knowledge and problem-solving skills, making it ideal for identifying a model’s understanding of a variety of topics.
  • Multi-task accuracy: The test measures the multi-task accuracy of the model by covering a diverse range of tasks, ensuring a comprehensive evaluation of the academic and professional knowledge of the model.
  • No need for large training sets: Unlike some other benchmarks, MMLU does not require a large training set. Instead, it assumes that the model has already acquired the necessary knowledge by reading a large number of different texts, a process often called pre-training.

These key features make the MMLU benchmark a valuable tool for evaluating the performance of language models and their ability to understand and generate language in a variety of contexts.

How MMLU works

The MMLU benchmark works by evaluating the performance of language models on a variety of tasks. It measures the model's ability to understand and generate language in different contexts, including machine translation, text summarization, and sentiment analysis.

The final MMLU score is the average of the scores obtained in each task, providing a comprehensive measure of the overall performance of the model.

MMLU Advantages

There are many benefits to the MMLU benchmark, the three most important of which are:

  1. It provides a quantitative way to compare the performance of different language models.
  2. It is computationally efficient and easy to understand.
  3. It considers the model’s ability to understand and generate language in various contexts and can capture certain aspects of language structure.

Limitations of MMLU

The MMLU benchmark also has some issues that make it a suboptimal benchmark:

  1. Key context is missing from the question: Some questions in the MMLU benchmark lack context, which makes them difficult or impossible to answer correctly, and these questions may be due to copy-paste errors.
  2. Answer set fuzziness: This benchmark contains questions with ambiguous answer sets that may lead to confusion and incorrect assessment of model performance.
  3. Wrong answer set: Some questions in the MMLU benchmark have incorrect answer sets, which may lead to misleading evaluation results.
  4. Sensitivity to cues: The MMLU is extremely sensitive to the exact cue used, which can cause performance to vary significantly depending on the cue.

References

【1】https://klu.ai/glossary/mmlu-eval