Test of Time Benchmark Dataset for Large Model Temporal Reasoning Capabilities
Date
Size
Publish URL
License
CC BY 4.0
Categories
Test of Time, or ToT for short, is a benchmark launched by researchers at Google DeepMind in 2024 specifically for evaluating the temporal reasoning capabilities of large language models. It examines the temporal understanding and arithmetic capabilities of LLMs from two independent dimensions.Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning"
The ToT dataset is divided into three subsets: ToT-semantic contains 1,850 examples, ToT-arithmetic contains 2,800 examples, and ToT-semantic-large contains 46,480 examples, which can measure the semantics and logic of temporal understanding at a larger scale.
Data Format
The ToT-semantic and ToT-semantic-large datasets contain the following fields:
- question: Contains the text of the question.
- graph_gen_algorithm: The name of the graph generator algorithm.
- question_type: corresponds to one of the 7 question types in the dataset.
- sorting_type: corresponds to the sorting type applied to the fact.
- prompt: Contains the complete prompt text used to evaluate the LLM task.
- label: The standard answer to the question.
The ToT-arithmetic dataset contains three fields: question, question_type, and label.
Data Source
ToT is synthetically generated using public libraries such as NetworkX.
- Purpose: ToT is primarily designed to be used as a test set.
- prohibit:Using ToT as a training set is strictly prohibited.