Command Palette
Search for a command to run...
MaxProof : Mise à l'échelle de la preuve mathématique avec un RL générateur-vérificateur et une mise à l'échelle au moment de l'inférence à l'échelle de la population
MaxProof : Mise à l'échelle de la preuve mathématique avec un RL générateur-vérificateur et une mise à l'échelle au moment de l'inférence à l'échelle de la population
Résumé
Nous présentons MaxProof, un cadre de mise à l'échelle lors de l'inférence à l'échelle de la population pour la démonstration mathématique de niveau compétition au sein de la série MiniMax-M3. M3 entraîne d'abord trois capacités orientées preuve -- la génération de preuves, la vérification de preuves et la correction de preuves conditionnée par la critique -- à l'aide d'un vérificateur génératif à défense en profondeur conçu pour un faible taux de faux positifs. Ces capacités sont fusionnées en un seul modèle M3 publié. Lors de l'inférence, MaxProof traite le modèle comme un générateur, un vérificateur, un raffineur et un classeur, effectue une recherche au sein d'une population de preuves candidates et retourne une preuve finale unique par sélection par tournoi. Grâce à la mise à l'échelle lors de l'inférence de MaxProof, le modèle M3 atteint 35/42 à l'OIM 2025 et 36/42 à l'USAMO 2026, dépassant le seuil de médaille d'or humain dans les deux compétitions.
One-sentence Summary
MaxProof is a population-level test-time scaling framework for the MiniMax-M3 series that integrates proof generation, verification, and critique-conditioned proof repair through a low false-positive generative verifier, applies tournament selection over candidate proofs to produce a final solution, and achieves 35 out of 42 on IMO 2025 and 36 out of 42 on USAMO 2026, exceeding the human gold-medal threshold on both competitions.
Key Contributions
- The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
- MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
- Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.
Introduction
Mathematical proof serves as a rigorous stress test for language model reasoning, demanding precise logical chains with minimal tolerance for error, yet prior systems often struggle with verifier noise, reward hacking, and the absence of executable oracles. The authors develop M3, a model trained through a defense-in-depth generative verifier to integrate proof generation, verification, and critique-conditioned repair while suppressing false positives that typically undermine reinforcement learning stability. They further leverage MaxProof, a population-level test-time scaling framework that orchestrates the model as a generator, verifier, refiner, and ranker to search candidate proofs and finalize selections via tournament competition, achieving gold-medal performance on IMO and USAMO benchmarks.
Dataset
1. Dataset Composition and Sources
- The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
- Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.
2. Key Details for Each Subset
- Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
- Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the
no_errorsandhas_errorsclasses, so the authors rebalance all four verdict categories to prevent model collapse. - Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as
minor_gaps,has_errors, orfundamentallyWrong.
3. Data Processing and Training Usage
- Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
- Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
- The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
- The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.
4. Metadata Construction and Structural Processing
- The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
- Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
- Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.
Experiment
The evaluation utilizes a two-stage design, first assessing the standalone M3 model on mathematical proof benchmarks to measure baseline reasoning capabilities, and then applying the MaxProof framework on competition problems to validate how population-level search and multi-layer verification transform raw ability into consistent performance. Earlier single-judge training methods proved vulnerable to systematic reward-hacking behaviors like length inflation and rigid templating, which directly informed the creation of a coordinated defense pipeline. Ultimately, the experiments demonstrate that MaxProof effectively bridges the gap between base model potential and reliable contest output through iterative refinement and expert-aligned scoring, while also highlighting that performance ceilings remain tied to underlying model reasoning and occasional ranking inconsistencies.