HyperAI

A team of 49 mathematicians has released the Leipzig Benchmark, a dataset comprising 100 research-level mathematics questions with verified solutions, designed to evaluate the reasoning capabilities of large language models. The initiative was conducted at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany, culminating from a three-day workshop between April 1 and May 15, 2026. The dataset addresses diverse areas including algebraic geometry, combinatorics, and representation theory, establishing a rigorous standard for testing AI performance on complex formal problems. The assessment utilized a three-stage evaluation protocol. Stage 1 involved a single attempt by five state-of-the-art large language models. Initial results indicated significant challenges, with 41 questions remaining unsolved, highlighting difficulties in single-pass reasoning for research-grade tasks. Stage 2 expanded the testing by running three models 20 times each. This iterative approach reduced the number of unsolved questions to 16, demonstrating that repeated sampling and computational effort substantially improve solution rates. Stage 3 evaluated two heavy-thinking models using a three-run protocol, which further improved performance, leaving only two questions unsolved. The reduction of unsolved questions from 41 to 16, and finally to 2 across the stages, underscores the rapid maturation of mathematical reasoning in artificial intelligence. The results indicate that heavy-thinking architectures and multi-run strategies are essential for handling advanced mathematical deduction. The Leipzig Benchmark provides the research community with a standardized instrument to track progress in automated theorem proving and formal reasoning. The complete dataset, statistical tables, and model results are available for academic use. Authors: Andrei Balakin, Miklós Bóna, Marie-Charlotte Brandenburg, Clara Briand, Veronica Calvo Cortes, Shelby Cox, Jesus A. De Loera, Danai Deligeorgaki, Hannah Friedman, Tim Gehrunger, Chiara Giardino, Stephen Griffeth, Baran Hashemi, Elena Hoster, Alexander Ivanov, Nupur Jain, Aryaman Jal, Leonie Kayser, Joris Koefler, Kevin Kühn, Mario Kummer, Felix Lotter, René Marczinzik, Victor S. Miller, Alejandro Morales, Greta Panova, Gianni Petrella, Nathan Pflueger, Lakshmi Ramesh, Nikolas Rieke, Carlos Rodriguez, Andrea Rosana, Flavio Salizzoni, Otto T.P. Schmidt, Sven Ulf Schmitz, Lina Maria Simbaqueba Marin, Luca Sodomaco, Christian Stump, Bernd Sturmfels, Alexander Taveira Blomenhofer, Simon Telen, Philipp Tuchel, Emil Verkama, Carl Felix Waller, Julian Weigert, Annette Werner, Nathan Williams, Claudius Zibrowius. Source: arXiv:2606.05818 [math.HO].

Related Links

Related Links

Related Links

Cambridge University and Others Have Proposed a pixel-level Fundamental Model for Earth Observation Missions, Achieving state-of-the-art (SOTA) Accuracy in Multiple missions.

Cambridge University and Others Have Proposed a pixel-level Fundamental Model for Earth Observation Missions, Achieving state-of-the-art (SOTA) Accuracy in Multiple missions.

Command Palette

Leipzig Benchmarks

Related Links

Command Palette

Leipzig Benchmarks

Related Links

Command Palette

Leipzig Benchmarks

Related Links

Cambridge University and Others Have Proposed a pixel-level Fundamental Model for Earth Observation Missions, Achieving state-of-the-art (SOTA) Accuracy in Multiple missions.

Cambridge University and Others Have Proposed a pixel-level Fundamental Model for Earth Observation Missions, Achieving state-of-the-art (SOTA) Accuracy in Multiple missions.