HyperAI

GAIA General AI Assistant Benchmark Dataset

Date

10 months ago

Organization

Hugging Face
Meta

Publish URL

huggingface.co

Download Help

GAIA was jointly launched by Meta, HuggingFace and AutoGPT in 2024 and is the most comprehensive benchmark for intelligent agents.GAIA: a benchmark for General AI Assistants".

GAIA consists of more than 450 complex questions with clear answers that require different levels of tools and autonomy to solve. Therefore, it is divided into 3 levels, where level 1 can be solved by very good LLMs, and level 3 indicates a significant improvement in model ability. Each level is divided into a fully public development set for validation, and a test set containing private answers and metadata.

Questions are contained in metadata.jsonl. Some questions come with an additional file which can be found in the same folder and whose ID is given in the field file_name. More details are given inpaperAnnounced in.

Here is an example of a tricky problem:

Which of the fruits shown in the 2008 painting "Embroidery of Uzbekistan" were part of the breakfast menu on the ocean liner in October 1949, which was later used as a floating prop in the film "The Last Voyage"? Give the fruits as a comma-separated list, in clockwise order from their arrangement in the painting, starting at the 12 o'clock position. Use the plural form of each fruit.

It can be seen that this problem involves several difficulties:

  • Answer in a constraint format.
  • Multimodal capabilities,are needed to read the fruit from the image.
  • There are multiple pieces of information that need to be collected, some of which depend on other information:
    • Fruits in pictures
    • The identity of the ocean liner used as a floating prop in The Last Voyage
    • The above Ocean Liner breakfast menu in October 1949
  • The above forces the correct solution path to use several chained steps.

Solving this problem requires a high level of planning ability and strict execution, which are exactly two areas where LLM has difficulty dealing with.

Therefore, it is an excellent test set for testing intelligent systems. On the public leaderboard of GAIA, the average score of GPT-4-Turbo is less than 7%. The highest submission is an Autogen-based solution that uses a complex multi-agent system and leverages OpenAI's tool call function, reaching 40%.