Command Palette
Search for a command to run...
LongCat-Flash-Prover:Agentic Tool-Integrated Reinforcement Learning による Native Formal Reasoning の進展
LongCat-Flash-Prover:Agentic Tool-Integrated Reinforcement Learning による Native Formal Reasoning の進展
概要
本研究では、Lean4 におけるネイティブな形式的推論を、エージェント型のツール統合推論(TIR)を通じて前進させるため、フラッグシップとなる 5600 億パラメータのオープンソース・ミクスチャー・オブ・エキスパート(MoE)モデル「LongCat-Flash-Prover」を提案する。ネイティブな形式的推論タスクを、自動形式化(auto-formalization)、スケッチ作成(sketching)、証明(proving)という 3 つの独立した形式的能力に分解する。これらの能力を促進するため、高品質なタスク軌道を拡張するハイブリッド・エキスパート反復フレームワークを考案する。具体的には、与えられた非形式的問題に基づいて形式的命題を生成したり、命題から直接完全な証明を導出したり、あるいは補題スタイルのスケッチを生成したりする。エージェント型強化学習(RL)の段階では、長期にわたるタスクにおける MoE モデルの訓練を安定化させることを目的とした階層的重要サンプリング方策最適化(HisPO)アルゴリズムを提示する。本アルゴリズムは、系列レベルおよびトークンレベルの両方において、方策の陳腐化およびトレーニング・推論エンジン間の本質的な乖離を考慮した勾配マスキング戦略を採用する。さらに、報酬ハッキングの問題を排除するため、定理の一貫性および合法性検出メカニズムも組み込んでいる。広範な評価により、LongCat-Flash-Prover が自動形式化および定理証明の両分野において、オープンウェイトモデルの新規最先端性能を確立したことが示された。顕著なサンプル効率を有し、問題あたり 72 の推論バジェットのみを用いて MiniF2F-Test で 97.1% のパス率を達成した。より困難なベンチマークにおいても、問題あたり 220 回以下の試行で ProverBench の 70.8% および PutnamBench の 41.5% を解決し、既存のオープンウェイトベースラインを著しく上回る性能を発揮した。
One-sentence Summary
The Meituan LongCat Team introduces LongCat-Flash-Prover, a 560-billion-parameter MoE model that advances Lean4 formal reasoning via a Hybrid-Experts Iteration Framework and HisPO algorithm. This approach stabilizes training on long-horizon tasks while achieving state-of-the-art performance in auto-formalization and theorem proving with exceptional sample efficiency.
Key Contributions
- The paper introduces LongCat-Flash-Prover, a 560-billion-parameter open-source Mixture-of-Experts model that advances native formal reasoning in Lean4 by decomposing tasks into auto-formalization, sketching, and proving capabilities enhanced through agentic tool-integrated reasoning.
- A Hybrid-Experts Iteration Framework is presented to generate high-quality task trajectories by employing specialized expert models that iteratively refine their outputs using verifiable formal tools for compilation and verification feedback.
- The work proposes the Hierarchical Importance Sampling Policy Optimization algorithm to stabilize Mixture-of-Experts training on long-horizon tasks by applying gradient masking for policy staleness and incorporating consistency detection mechanisms to prevent reward hacking, achieving state-of-the-art results on benchmarks like MiniF2F-Test and PutnamBench.
Introduction
Formal theorem proving in languages like Lean4 is critical for ensuring mathematical rigor and reliable AI reasoning, yet current LLMs struggle to bridge the gap between informal natural language problems and verified formal proofs. Prior approaches often treat auto-formalization and proving as separate tasks or rely on vanilla tool-integrated reasoning that fails to handle the strict logical progression and verification feedback loops required by formal systems. To address these challenges, the authors introduce LongCat-Flash-Prover, a 560-billion-parameter MoE model that unifies native formal reasoning through a Hybrid-Experts Iteration Framework and a novel Hierarchical Importance Sampling Policy Optimization algorithm. This approach decomposes reasoning into auto-formalization, sketching, and proving capabilities while stabilizing long-horizon RL training and preventing reward hacking through legality detection, ultimately setting new state-of-the-art benchmarks for open-weights models.
Dataset
-
Dataset Composition and Sources The authors curate training data from two primary perspectives: mathematics queries for native formal reasoning and general-purpose queries for informal reasoning. The informal reasoning component includes conversation, STEM-like reasoning, coding, agentic tool use, searching, knowledge, and safety data sampled from the LongCat-Flash-Thinking-2601 cold-start set. The formal reasoning component combines open-source datasets and an in-house corpus, further enhanced by a self-synthesis framework that transforms informal tasks into formal ones.
-
Key Details for Synthesized Subsets The hybrid-experts framework generates six distinct trajectory sets through a curriculum learning approach that progresses from single-turn to multi-turn tool interactions:
- Auto-Formalization Sets: Includes single-turn trajectories (Daf) where formal statements pass validation without tool feedback, and multi-turn trajectories (Daf′) where the model iterates using tool feedback until verification succeeds.
- Whole-Proof Sets: Contains single-turn whole-proofs (Dwhole,pf) and multi-turn whole-proofs with tool interaction (Dwhole,pf′), both requiring the final proof to pass synthesis and lemma checks.
- Sketching Sets: Comprises lemma-style sketches (Dsk′) generated when whole proofs fail, and corresponding lemma-style proofs (Dsk,pf′) that complete the decomposition.
- Filtering Rules: All subsets retain only verified trajectories where specific validation metrics (e.g., Vsyn, Vcon, Vleg) return positive results like "PASS" or "SORRY" with consistency checks.
-
Data Usage and Training Strategy The authors employ a progressive synthesis strategy where single-turn trajectories represent simpler tasks and tool-interaction trajectories represent more difficult ones.
- Mixture and Splitting: Prompts associated with tool calls are further divided into subsets for cold-start training and Reinforcement Learning (RL) training.
- Difficulty Filtering: Prompts with a difficulty score of 0 are retained for future synthesis cycles, while those with a score of 1 for two consecutive iterations are removed to improve efficiency.
- Training Phases: The cold-start phase mixes generic data to preserve informal reasoning capabilities, while subsequent iterations focus on the synthesized formal trajectories.
-
Processing and Metadata Construction
- Basic Processing: The pipeline includes semantic deduplication, desaturation, and quality assurance. Formal statements generated during iterations are filtered to remove duplicates and prevent data leakage from the test set.
- Difficulty Estimation: A metric calculates difficulty based on the pass rate of N synthesized trajectories per prompt, allowing the system to dynamically select challenging data for training.
- Diversity Sampling: To prevent overfitting, each prompt is restricted to a single trajectory. The authors use a weighted sampling scheme that evaluates average trajectory length and tool call frequency, preferentially selecting responses with shorter lengths or fewer tool calls to avoid excessive reasoning.
- Benchmark Integration: Evaluation utilizes diverse benchmarks including Combibench, FormalMath-lite, MathOlympiad-Bench, MiniF2F, ProofNet, ProverBench, and PutnamBench, all featuring natural language problems paired with formal counterparts.
Method
The authors introduce a Hybrid-Experts Iteration Framework designed to synthesize verified trajectories and continuously train professional native formal reasoning expert models. This approach decomposes native formal reasoning into three main capabilities: auto-formalization, sketching, and proving. Each capability corresponds to a specialized expert model, formally represented as πθaf, πθsk, and πθpf, respectively. To ensure robustness, the system maintains specific tool sets for each expert to facilitate Tool-Integrated Reasoning (TIR) trajectories synthesis and rejected sampling.
The Auto-Formalization expert transforms natural language problem statements into formal Lean4 statements. To address potential syntax errors or semantic inconsistencies, the authors employ tools for statement syntax detection and semantic consistency detection. The Sketching expert generates lemma-style sketches, breaking down complex theorems into manageable helper lemmas using a Divide and Conquer strategy. Finally, the Proving expert handles theorem proving through two schemas: Whole-Proof Generation and Sketch-Proof Generation. These processes utilize verifiers for syntax and legality to ensure the generated proofs are valid and consistent with the formal statements.

The training process for the LongCat-Flash-Prover follows a structured pipeline organized into two main phases. The Cold-Start Phase utilizes an auto-formalization model and a thinking model to synthesize multiple formal statements and high-quality agentic trajectories. These trajectories are curated into a dataset through decontamination and sampling, then used to perform Domain-Mixed SFT on a LongCat Mid-train Base Model to obtain a Cold-start Model.
The Iteration Phase leverages the Cold-start Model as a new expert to refresh prompts and trajectories through self-distillation. This phase incorporates general reasoning data to maintain informal reasoning capabilities. The model is further trained using Domain-Mixed SFT and Agentic TIR RL. This iterative cycle allows the system to enrich trajectories and experts, progressively refining the model's performance until the final LongCat-Flash-Prover is reached.

To handle complex proofs, the system extends the trained model into a unified Judger-Sketcher-and-Prover capable of performing tree search in the lemma space. This method allows the agent to decompose complex goals via sketching, solve sub-goals via proving, and prune the search tree via judging. The proved lemmas are simplified as axioms in the context to compress agent memory usage.

For the agentic RL component, the authors introduce a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm. This algorithm implements a hierarchical gradient masking strategy to estimate training-inference consistency at sequence-level and token-level granularity. This ensures stable optimization by mitigating issues such as distribution drift and policy staleness that arise in asynchronous training environments.
Experiment
- Auto-formalization experiments validate that LongCat-Flash-Prover achieves state-of-the-art results across multiple benchmarks, with tool feedback significantly enhancing the ability to solve previously unsolvable tasks and demonstrating that general reasoning models can effectively transfer to specialized formalization tasks.
- Theorem proving evaluations confirm the model's superior efficacy in whole-proof and sketch-proof modes, establishing new records for open-source provers with high sample efficiency and showing that tree search strategies further improve accuracy by simplifying lemma decomposition.
- General informal reasoning tests indicate that while native formal reasoning training causes a slight performance drop on informal tasks, the model retains sufficient capability, suggesting an acceptable trade-off between formal and informal reasoning skills.
- Analysis of reward hacking reveals critical vulnerabilities in existing evaluation pipelines where models exploit syntax loopholes to generate fake proofs, and the implementation of AST-based verification successfully suppresses these cheating behaviors to ensure valid proof generation.