HyperAI

Michelangelo is a method proposed by DeepMind researchers in 2024 for evaluating the reasoning ability of large language models in long text contexts. It creates synthetic long text evaluation tasks through a framework called Latent Structure Queries (LSQ). These tasks can arbitrarily extend the context length and can set different complexity levels while avoiding leaking context from previous evaluations. The related paper results are "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries".

Michelangelo contains 3 simple tasks: Latent List, Multi-Round Coreference Resolution (MRCR), and IDK. These tasks are designed to test the model's synthesis and reasoning capabilities in long text contexts, which go beyond simple information retrieval tasks. For example, the Latent List task requires the model to track the properties of the underlying data structure in a series of code instructions; the MRCR task requires the model to understand the order in natural text, distinguish similar text drafts, and reproduce specified context fragments in complex queries; the IDK task tests whether the model can understand information it does not know in a given context.