HyperAIHyperAI

Command Palette

Search for a command to run...

2달 전
LLM
Reasoning

LongCat-Flash-Prover: Agentic Tool-Integrated Reinforcement Learning를 통한 Native Formal Reasoning의 발전

초록

저희는 Lean4 환경에서 에이전트 기반 도구 통합 추론 (TIR) 을 통해 네이티브 형식 추론을 고도화하는 플래그십 5600 억 매개변수 오픈소스 혼합 전문가 (Mixture-of-Experts, MoE) 모델인 LongCat-Flash-Prover 를 소개합니다. 네이티브 형식 추론 작업을 자동 형식화 (auto-formalization), 스케치 작성 (sketching), 증명 (proving) 의 세 가지 독립적인 형식 능력으로 분해합니다. 이러한 능력을 구현하기 위해, 주어진 비형식적 문제를 기반으로 형식 명제를 생성하거나, 명제로부터 직접 전체 증명을 도출하거나, 보조 정리 (lemma) 스타일의 스케치를 생성하는 고품질 작업 궤적을 확장하는 하이브리드 전문가 반복 프레임워크 (Hybrid-Experts Iteration Framework) 를 제안합니다. 에이전트 강화학습 (RL) 과정에서는 장시간 범위의 작업에서 MoE 모델의 학습을 안정화하기 위해 계층적 중요도 샘플링 정책 최적화 (Hierarchical Importance Sampling Policy Optimization, HisPO) 알고리즘을 제시합니다. 이 알고리즘은 시퀀스 및 토큰 수준에서 정책의 노후화 (staleness) 와 학습 - 추론 엔진 간의 고유한 불일치를 고려한 그라디언트 마스킹 전략을 적용합니다. 또한, 보상 해킹 (reward hacking) 문제를 제거하기 위해 정리 일관성 및 합법성 감지 메커니즘을 통합했습니다. 광범위한 평가 결과, LongCat-Flash-Prover 는 자동 형식화 및 정리 증명 분야에서 오픈 가중치 모델을 위한 새로운 최고 성능 (state-of-the-art) 을 달성함을 확인했습니다. 뛰어난 샘플 효율성을 입증하여, 문제당 72 번의 추론 예산만으로 MiniF2F-Test 에서 97.1% 의 통과율을 기록했습니다. 더 까다로운 벤치마크에서는 문제당 220 회 이하의 시도만으로 ProverBench 의 70.8% 와 PutnamBench 의 41.5% 를 해결하여 기존 오픈 가중치 베이스라인을 크게 능가했습니다.

One-sentence Summary

The Meituan LongCat Team introduces LongCat-Flash-Prover, a 560-billion-parameter MoE model that advances Lean4 formal reasoning via a Hybrid-Experts Iteration Framework and HisPO algorithm. This approach stabilizes training on long-horizon tasks while achieving state-of-the-art performance in auto-formalization and theorem proving with exceptional sample efficiency.

Key Contributions

  • The paper introduces LongCat-Flash-Prover, a 560-billion-parameter open-source Mixture-of-Experts model that advances native formal reasoning in Lean4 by decomposing tasks into auto-formalization, sketching, and proving capabilities enhanced through agentic tool-integrated reasoning.
  • A Hybrid-Experts Iteration Framework is presented to generate high-quality task trajectories by employing specialized expert models that iteratively refine their outputs using verifiable formal tools for compilation and verification feedback.
  • The work proposes the Hierarchical Importance Sampling Policy Optimization algorithm to stabilize Mixture-of-Experts training on long-horizon tasks by applying gradient masking for policy staleness and incorporating consistency detection mechanisms to prevent reward hacking, achieving state-of-the-art results on benchmarks like MiniF2F-Test and PutnamBench.

Introduction

Formal theorem proving in languages like Lean4 is critical for ensuring mathematical rigor and reliable AI reasoning, yet current LLMs struggle to bridge the gap between informal natural language problems and verified formal proofs. Prior approaches often treat auto-formalization and proving as separate tasks or rely on vanilla tool-integrated reasoning that fails to handle the strict logical progression and verification feedback loops required by formal systems. To address these challenges, the authors introduce LongCat-Flash-Prover, a 560-billion-parameter MoE model that unifies native formal reasoning through a Hybrid-Experts Iteration Framework and a novel Hierarchical Importance Sampling Policy Optimization algorithm. This approach decomposes reasoning into auto-formalization, sketching, and proving capabilities while stabilizing long-horizon RL training and preventing reward hacking through legality detection, ultimately setting new state-of-the-art benchmarks for open-weights models.

Dataset

  • Dataset Composition and Sources The authors curate training data from two primary perspectives: mathematics queries for native formal reasoning and general-purpose queries for informal reasoning. The informal reasoning component includes conversation, STEM-like reasoning, coding, agentic tool use, searching, knowledge, and safety data sampled from the LongCat-Flash-Thinking-2601 cold-start set. The formal reasoning component combines open-source datasets and an in-house corpus, further enhanced by a self-synthesis framework that transforms informal tasks into formal ones.

  • Key Details for Synthesized Subsets The hybrid-experts framework generates six distinct trajectory sets through a curriculum learning approach that progresses from single-turn to multi-turn tool interactions:

    • Auto-Formalization Sets: Includes single-turn trajectories (Daf\mathcal{D}_{af}Daf) where formal statements pass validation without tool feedback, and multi-turn trajectories (Daf\mathcal{D}_{af}^{\prime}Daf) where the model iterates using tool feedback until verification succeeds.
    • Whole-Proof Sets: Contains single-turn whole-proofs (Dwhole,pf\mathcal{D}_{whole,pf}Dwhole,pf) and multi-turn whole-proofs with tool interaction (Dwhole,pf\mathcal{D}_{whole,pf}^{\prime}Dwhole,pf), both requiring the final proof to pass synthesis and lemma checks.
    • Sketching Sets: Comprises lemma-style sketches (Dsk\mathcal{D}_{sk}^{\prime}Dsk) generated when whole proofs fail, and corresponding lemma-style proofs (Dsk,pf\mathcal{D}_{sk,pf}^{\prime}Dsk,pf) that complete the decomposition.
    • Filtering Rules: All subsets retain only verified trajectories where specific validation metrics (e.g., Vsyn\mathcal{V}_{syn}Vsyn, Vcon\mathcal{V}_{con}Vcon, Vleg\mathcal{V}_{leg}Vleg) return positive results like "PASS" or "SORRY" with consistency checks.
  • Data Usage and Training Strategy The authors employ a progressive synthesis strategy where single-turn trajectories represent simpler tasks and tool-interaction trajectories represent more difficult ones.

    • Mixture and Splitting: Prompts associated with tool calls are further divided into subsets for cold-start training and Reinforcement Learning (RL) training.
    • Difficulty Filtering: Prompts with a difficulty score of 0 are retained for future synthesis cycles, while those with a score of 1 for two consecutive iterations are removed to improve efficiency.
    • Training Phases: The cold-start phase mixes generic data to preserve informal reasoning capabilities, while subsequent iterations focus on the synthesized formal trajectories.
  • Processing and Metadata Construction

    • Basic Processing: The pipeline includes semantic deduplication, desaturation, and quality assurance. Formal statements generated during iterations are filtered to remove duplicates and prevent data leakage from the test set.
    • Difficulty Estimation: A metric calculates difficulty based on the pass rate of NNN synthesized trajectories per prompt, allowing the system to dynamically select challenging data for training.
    • Diversity Sampling: To prevent overfitting, each prompt is restricted to a single trajectory. The authors use a weighted sampling scheme that evaluates average trajectory length and tool call frequency, preferentially selecting responses with shorter lengths or fewer tool calls to avoid excessive reasoning.
    • Benchmark Integration: Evaluation utilizes diverse benchmarks including Combibench, FormalMath-lite, MathOlympiad-Bench, MiniF2F, ProofNet, ProverBench, and PutnamBench, all featuring natural language problems paired with formal counterparts.

Method

The authors introduce a Hybrid-Experts Iteration Framework designed to synthesize verified trajectories and continuously train professional native formal reasoning expert models. This approach decomposes native formal reasoning into three main capabilities: auto-formalization, sketching, and proving. Each capability corresponds to a specialized expert model, formally represented as πθaf\pi_{\theta_{af}}πθaf, πθsk\pi_{\theta_{sk}}πθsk, and πθpf\pi_{\theta_{pf}}πθpf, respectively. To ensure robustness, the system maintains specific tool sets for each expert to facilitate Tool-Integrated Reasoning (TIR) trajectories synthesis and rejected sampling.

The Auto-Formalization expert transforms natural language problem statements into formal Lean4 statements. To address potential syntax errors or semantic inconsistencies, the authors employ tools for statement syntax detection and semantic consistency detection. The Sketching expert generates lemma-style sketches, breaking down complex theorems into manageable helper lemmas using a Divide and Conquer strategy. Finally, the Proving expert handles theorem proving through two schemas: Whole-Proof Generation and Sketch-Proof Generation. These processes utilize verifiers for syntax and legality to ensure the generated proofs are valid and consistent with the formal statements.

The training process for the LongCat-Flash-Prover follows a structured pipeline organized into two main phases. The Cold-Start Phase utilizes an auto-formalization model and a thinking model to synthesize multiple formal statements and high-quality agentic trajectories. These trajectories are curated into a dataset through decontamination and sampling, then used to perform Domain-Mixed SFT on a LongCat Mid-train Base Model to obtain a Cold-start Model.

The Iteration Phase leverages the Cold-start Model as a new expert to refresh prompts and trajectories through self-distillation. This phase incorporates general reasoning data to maintain informal reasoning capabilities. The model is further trained using Domain-Mixed SFT and Agentic TIR RL. This iterative cycle allows the system to enrich trajectories and experts, progressively refining the model's performance until the final LongCat-Flash-Prover is reached.

To handle complex proofs, the system extends the trained model into a unified Judger-Sketcher-and-Prover capable of performing tree search in the lemma space. This method allows the agent to decompose complex goals via sketching, solve sub-goals via proving, and prune the search tree via judging. The proved lemmas are simplified as axioms in the context to compress agent memory usage.

For the agentic RL component, the authors introduce a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm. This algorithm implements a hierarchical gradient masking strategy to estimate training-inference consistency at sequence-level and token-level granularity. This ensures stable optimization by mitigating issues such as distribution drift and policy staleness that arise in asynchronous training environments.

Experiment

  • Auto-formalization experiments validate that LongCat-Flash-Prover achieves state-of-the-art results across multiple benchmarks, with tool feedback significantly enhancing the ability to solve previously unsolvable tasks and demonstrating that general reasoning models can effectively transfer to specialized formalization tasks.
  • Theorem proving evaluations confirm the model's superior efficacy in whole-proof and sketch-proof modes, establishing new records for open-source provers with high sample efficiency and showing that tree search strategies further improve accuracy by simplifying lemma decomposition.
  • General informal reasoning tests indicate that while native formal reasoning training causes a slight performance drop on informal tasks, the model retains sufficient capability, suggesting an acceptable trade-off between formal and informal reasoning skills.
  • Analysis of reward hacking reveals critical vulnerabilities in existing evaluation pipelines where models exploit syntax loopholes to generate fake proofs, and the implementation of AST-based verification successfully suppresses these cheating behaviors to ensure valid proof generation.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp