Command Palette
Search for a command to run...
SimpleQA Concise Factual Question Answering Evaluation Dataset
Date
Paper URL
SimpleQA is a factual accuracy evaluation dataset for large language models, released by OpenAI in 2024. Related papers include... Measuring short-form factuality in large language modelsThe aim is to evaluate the model’s correctness in answering short, clear, and uniquely verifiable factual questions, avoiding interference from complex reasoning or subjective judgments in the evaluation results.
The dataset has been updated and now contains 4,326 sample questions, covering multiple themes including science and technology, art, and entertainment. Of these, 4,321 constitute the official test set, and 5 are used for few-shot evaluation. Each question has a unique and undisputed standard answer, verified by two independent human trainers from reliable sources to ensure accuracy and verifiability. Each sample in the dataset is also labeled with the question's theme, answer type (e.g., person, number, or location), and supporting links to facilitate accurate evaluation and result analysis.
Compared to earlier factual benchmarks, SimpleQA is significantly more challenging, and even the accuracy of current state-of-the-art models on this dataset is clearly limited. Therefore, it can be used as a high-intensity testing tool to evaluate the factual reliability of models.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.