Command Palette
Search for a command to run...
LEAP: Beschleunigung von LLMs für formale Mathematik mit agentic frameworks
LEAP: Beschleunigung von LLMs für formale Mathematik mit agentic frameworks
Zusammenfassung
Große Sprachmodelle (LLMs) zeigen zwar starke Fähigkeiten im informellen mathematischen Reasoning, tun sich jedoch schwer bei der Generierung mechanisch überprüfbarer Beweise in formalen Sprachen wie Lean. Wir präsentieren LEAP (LLM-in-Lean Environment Agentic Prover), ein agentic Framework, das general-purpose foundation models in die Lage versetzt, state-of-the-art Ergebnisse im automatisierten formalen Theorembeweis zu erzielen. LEAP nutzt die Stärken von foundation models, wie informelles Reasoning, Befolgung von Instruktionen und iterative Selbstverfeinerung. Durch die Zerlegung komplexer Probleme in kleinere Einheiten verbindet das System den Aufbau formaler Beweise mit informellen Entwurfsskizzen durch kontinuierliche Interaktion mit dem Lean-Compiler.Um eine rigorose Bewertung jenseits zunehmend gesättigter Benchmarks zu ermöglichen, führen wir Lean-IMO-Bench ein: einen Benchmark mit IMO-ähnlichen Problemen, die in Lean formalisiert wurden. Diese zeichnet sich durch knappe Aussageformulierungen aus, erfordert jedoch hochgradig unübliche und mehrstufige Beweise über ein breites Spektrum an Schwierigkeitsstufen hinweg.Empirisch zeigt LEAP auf dem Putnam-Wettbewerb 2025, einem jährlichen Mathematikwettbewerb für Studierende in Nordamerika, dass alle 12 Aufgaben gelöst werden konnten – ein Ergebnis, das mit den jüngsten Durchbrüchen von fortschrittlichen formalen mathematischen Modellen mithält. Auf Lean-IMO-Bench steigert LEAP die one-shot Formal-Lösungsrate general-purpose LLMs von unter 10 % auf 70 % und übertrifft damit den von einem spezialisierten IMO-System auf Goldmedaillenniveau gesetzten Benchmark von 48 % deutlich.Zudem demonstrieren wir den nutzerseitigen Forschungsanspruch von LEAP, indem es autonome Formalisierungen komplexer Beweise für offene kombinatorische Fragestellungen vorantreibt, einschließlich eines verifizierten Beweises für ein Schlüsselunterproblem in Knuths Hamiltonscher Zerlegung von Cayley-Graphen gerader Ordnung.
One-sentence Summary
LEAP (LLM-in-Lean Environment Agentic Prover) enables general-purpose foundation models to achieve state-of-the-art performance in automated formal theorem proving by bridging formal proof construction with informal blueprints through continuous interaction with the Lean compiler, solving all 12 problems in the 2025 Putnam Competition and boosting one-shot formal solve rates from below 10% to 70% on the Lean-IMO-Bench while autonomously formalizing complex proofs for open combinatorial challenges including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.
Key Contributions
- The paper introduces LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance in automated formal theorem proving. This method decomposes complex problems into smaller units and bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler.
- To provide rigorous evaluation beyond saturated benchmarks, the work presents Lean-IMO-Bench, a collection of IMO-style problems formalized in Lean with highly non-routine and multi-step proofs. This dataset targets elementary statements requiring structurally intricate solutions across a wide range of difficulty levels.
- Experimental results demonstrate that LEAP solves all 12 problems on the 2025 Putnam Competition and increases the one-shot formal solve rate on Lean-IMO-Bench from below 10% to 70%. These findings indicate that structured, iterative interaction with the proof environment is more critical than formal language comprehension alone for general-purpose LLMs.
Introduction
Large Language Models excel at informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. This limitation matters because natural language solutions often contain hallucinations that are difficult to verify, whereas formal languages provide automated verification with guaranteed accuracy. Existing approaches typically rely on specialized models fine-tuned on formal corpora or hybrid systems where general models only handle informal reasoning. To address this, the authors introduce LEAP, an agentic framework that enables general-purpose foundation models to achieve leading performance in automated formal theorem proving without specialized fine-tuning. LEAP decomposes complex problems into smaller units and iteratively refines proofs using compiler feedback to bridge informal blueprints with formal construction. Additionally, the team releases Lean-IMO-Bench to evaluate progress on challenging multi-step problems beyond saturated benchmarks.
Dataset
- Dataset Composition and Source: The authors introduce LEAN-IMO-BENCH, a curated collection of 60 problems building upon the IMO-ProofBench suite from Luong et al. (2025).
- Subset Details: The benchmark divides evenly into a Basic set and an Advanced set of 30 problems each. The Basic set spans pre-IMO to IMO-Medium difficulty with 8 algebra, 8 combinatorics, 8 number theory, and 6 geometry problems. The Advanced set includes novel problems up to IMO-Hard difficulty featuring 8 algebra, 8 combinatorics, 6 number theory, and 8 geometry problems.
- Usage in Paper: The team utilizes this data primarily for the Formal Theorem Proving task to evaluate model performance, noting that natural language reasoning does not guarantee success in formal code generation.
- Processing and Verification: Lean experts manually formalized and verified every problem statement to ensure high accuracy. The corresponding Lean solutions are kept concise to avoid overhead from complex modern mathematical theories while relying on elementary mathematical backgrounds.
Method
The authors introduce LEAP, an agentic framework for automated theorem proving that leverages hierarchical decomposition and planning to bridge the gap between general LLMs and formal verification. Rather than synthesizing a complete proof in a single pass, LEAP incrementally drafts blueprints, decomposes Lean goals into supporting lemmas, and maintains the evolving proof plan as an AND-OR DAG.
The overall workflow is depicted in the framework diagram below.
Given an input theorem, the system registers the Lean statement as the root goal, represented as an OR node in the AND-OR DAG. A State Reader retrieves the current goal's statement, dependencies, and related lemmas. LEAP first attempts a direct proof strategy. This involves an NL Prover generating an informal argument, which a Formal Prover translates into Lean code. A Verifier checks the candidate against the Lean compiler. If the proof fails, a Reviser uses feedback to attempt a rewrite.
If direct proving fails, the system shifts to a decomposition strategy known as Blueprint Generation. The agent drafts an informal blueprint proposing intermediate lemmas, then translates this plan into a Lean proof sketch. Crucially, the sketch proves the current goal assuming only the proposed lemmas are true: the main theorem body is sorry-free, while sorry placeholders are permitted only in the newly proposed lemma statements. This sketch is validated by a Verifier (Lean) and a Reviewer (LLM). The Reviewer assesses whether the subgoals are relevant and make the problem easier, acting as a search filter to prevent the agent from expanding weak blueprints.
When a sketch is accepted, it is added as an AND node, and the proposed lemmas are added as child OR nodes. This ensures that once all child subgoals are proved, the parent goal is also proved. The State Writer then checks that the update preserves acyclicity before committing it to the DAG. The agent then recursively processes the newly created subgoals.
The dependency structure of the proof plan is illustrated in the DAG example below.
This structure supports two central advantages. First, monotone refinement allows the search to focus on expanding descendants without restructuring the established dependency order. Second, lemma memoization stores intermediate lemma statements as shared nodes, allowing them to be reused whenever the same subproblem arises in different branches. This also supports anticipatory lemma planning, where auxiliary lemmas not immediately required by the current sketch are proposed to support later proof steps.
The method relies on three tightly coupled design choices. Hierarchical memoization via the DAG preserves progress and reuses lemmas across branches. Interleaved informal-formal planning connects natural-language strategies with executable Lean code, ensuring that each formal attempt is paired with an informal rationale. Finally, verification-guided proof search uses compiler feedback and LLM-based review to accept, revise, decompose, or abandon candidate branches, effectively guiding the search through the formal proof space.
Experiment
The evaluation assesses the LEAP framework against multiple baselines on the Putnam 2025 and Lean-IMO-Bench datasets to validate its formal proving capabilities. Results indicate that LEAP significantly outperforms specialized and general models by resolving complex search bottlenecks and achieving complete success on challenging benchmarks where others fail. Ablation studies further demonstrate that the system's DAG-based memory and LLM-guided review mechanisms are critical for reducing redundant derivations and pruning unproductive proof branches, while case studies on open combinatorial problems confirm the framework's ability to autonomously synthesize rigorous proofs for intricate mathematical tasks.
The study compares one-shot and iterative formalization performance between a specialized automated theorem prover and a general foundation model. The general model demonstrates significant improvement when allowed iterative revision steps, whereas the specialized model shows no benefit from the feedback loop. This highlights the importance of general reasoning capabilities for handling the complex context required in iterative proof generation. The general foundation model outperforms the specialized prover in both one-shot and iterative evaluation settings. Iterative revision steps lead to substantial performance gains for the general model, while the specialized model fails to benefit from the feedback loop. These findings suggest that interpreting compiler errors and maintaining context are capabilities where general models excel over specialized provers.
The authors compare a full DAG architecture against a naive tree variant to evaluate the impact of global lemma sharing on formal proof generation. The results indicate that the Full DAG configuration consistently outperforms the baseline, particularly in advanced algebra and number theory tasks where it achieves perfect solve rates. This suggests that maintaining a global proof memory significantly enhances the system's ability to handle complex mathematical reasoning. The Full DAG configuration achieves perfect performance in Algebra and Number Theory across both difficulty tiers, whereas the tree baseline struggles with advanced Number Theory. Overall performance shows a substantial improvement with the Full DAG setup, especially in the Advanced set where the tree baseline performs significantly worse. While Geometry remains the most challenging category, the Full DAG configuration manages to solve some problems where the Naive Tree baseline fails completely.
The data presents baseline performance of Gemini models on the Lean-IMO-Bench across three evaluation tasks: natural language proofing, formal theorem proving, and formal proof translation. While natural language proof generation achieves moderate success, the models struggle significantly with generating or translating formal Lean proofs. Performance consistently degrades when moving from the Basic set to the Advanced set, highlighting the difficulty current foundation models face with complex formal verification without specialized agentic frameworks. Natural language proof generation outperforms formal theorem proving and proof translation tasks. Performance drops significantly on the Advanced set compared to the Basic set for all evaluated tasks. Average success rates for formal tasks remain very low even with extensive sampling attempts.
The the the table compares the problem-solving capabilities of the LEAP framework against various baseline methods on the Putnam 2025 benchmark. While direct formalization baselines fail to solve any problems, the proposed LEAP framework achieves a perfect solve rate across the entire dataset. This performance significantly outperforms other systems, including the specialized Aristotle model which solves a majority but not all problems. Direct formalization baselines like Gemini-3.1-pro and Goedel-Prover-V2-32B fail to solve any of the benchmark problems. The LEAP framework successfully solves all problems in the benchmark, achieving a higher success rate than all other compared methods. Specialized systems like Aristotle and Hilbert show partial success but fall short of the complete coverage achieved by LEAP.
The evaluation on the Lean-IMO-Bench compares the proposed LEAP framework against direct formalization and agentic baselines across Basic and Advanced difficulty tiers. Results indicate that while baseline methods struggle significantly with increased complexity and specific domains like Geometry, LEAP demonstrates superior robustness and generalization. The proposed method achieves the highest overall performance and maintains perfect solve rates in Algebra and Number Theory regardless of difficulty. LEAP achieves the highest overall solve rates on the benchmark, outperforming specialized and general-purpose baselines across both difficulty tiers. Baseline models exhibit significant performance drops on the Advanced set, whereas LEAP maintains strong consistency, including perfect scores in Algebra and Number Theory. Geometry proves to be a difficult category for all methods, with performance remaining negligible across the board.
Comparative evaluations indicate that general foundation models benefit significantly from iterative revision steps and outperform specialized provers in handling complex reasoning contexts. Architectural analysis reveals that a Full DAG configuration with global lemma sharing substantially enhances performance on advanced algebra and number theory tasks compared to naive tree baselines. While standard models struggle with formal proof generation on difficult benchmarks, the proposed LEAP framework demonstrates superior robustness by achieving perfect solve rates across multiple evaluation sets.