il y a 10 heures

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Table des matières

Résumé

Nous présentons MaxProof, un cadre de mise à l'échelle lors de l'inférence à l'échelle de la population pour la démonstration mathématique de niveau compétition au sein de la série MiniMax-M3. M3 entraîne d'abord trois capacités orientées preuve -- la génération de preuves, la vérification de preuves et la correction de preuves conditionnée par la critique -- à l'aide d'un vérificateur génératif à défense en profondeur conçu pour un faible taux de faux positifs. Ces capacités sont fusionnées en un seul modèle M3 publié. Lors de l'inférence, MaxProof traite le modèle comme un générateur, un vérificateur, un raffineur et un classeur, effectue une recherche au sein d'une population de preuves candidates et retourne une preuve finale unique par sélection par tournoi. Grâce à la mise à l'échelle lors de l'inférence de MaxProof, le modèle M3 atteint 35/42 à l'OIM 2025 et 36/42 à l'USAMO 2026, dépassant le seuil de médaille d'or humain dans les deux compétitions.

One-sentence Summary

MaxProof is a population-level test-time scaling framework for the MiniMax-M3 series that integrates proof generation, verification, and critique-conditioned proof repair through a low false-positive generative verifier, applies tournament selection over candidate proofs to produce a final solution, and achieves 35 out of 42 on IMO 2025 and 36 out of 42 on USAMO 2026, exceeding the human gold-medal threshold on both competitions.

Key Contributions

The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.

Introduction

Mathematical proof serves as a rigorous stress test for language model reasoning, demanding precise logical chains with minimal tolerance for error, yet prior systems often struggle with verifier noise, reward hacking, and the absence of executable oracles. The authors develop M3, a model trained through a defense-in-depth generative verifier to integrate proof generation, verification, and critique-conditioned repair while suppressing false positives that typically undermine reinforcement learning stability. They further leverage MaxProof, a population-level test-time scaling framework that orchestrates the model as a generator, verifier, refiner, and ranker to search candidate proofs and finalize selections via tournament competition, achieving gold-medal performance on IMO and USAMO benchmarks.

Dataset

1. Dataset Composition and Sources

The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.

2. Key Details for Each Subset

Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the no_errors and has_errors classes, so the authors rebalance all four verdict categories to prevent model collapse.
Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as minor_gaps, has_errors, or fundamentallyWrong.

3. Data Processing and Training Usage

Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.

4. Metadata Construction and Structural Processing

The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.

Experiment

The evaluation utilizes a two-stage design, first assessing the standalone M3 model on mathematical proof benchmarks to measure baseline reasoning capabilities, and then applying the MaxProof framework on competition problems to validate how population-level search and multi-layer verification transform raw ability into consistent performance. Earlier single-judge training methods proved vulnerable to systematic reward-hacking behaviors like length inflation and rigid templating, which directly informed the creation of a coordinated defense pipeline. Ultimately, the experiments demonstrate that MaxProof effectively bridges the gap between base model potential and reliable contest output through iterative refinement and expert-aligned scoring, while also highlighting that performance ceilings remain tied to underlying model reasoning and occasional ranking inconsistencies.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a 10 heures

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Table des matières

Résumé

One-sentence Summary

Key Contributions

The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.

Introduction

Dataset

1. Dataset Composition and Sources

The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.

2. Key Details for Each Subset

Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the no_errors and has_errors classes, so the authors rebalance all four verdict categories to prevent model collapse.
Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as minor_gaps, has_errors, or fundamentallyWrong.

3. Data Processing and Training Usage

Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.

4. Metadata Construction and Structural Processing

The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.

Experiment

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

MaxProof : Mise à l'échelle de la preuve mathématique avec un RL générateur-vérificateur et une mise à l'échelle au moment de l'inférence à l'échelle de la population

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

MaxProof : Mise à l'échelle de la preuve mathématique avec un RL générateur-vérificateur et une mise à l'échelle au moment de l'inférence à l'échelle de la population

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

MaxProof : Mise à l'échelle de la preuve mathématique avec un RL générateur-vérificateur et une mise à l'échelle au moment de l'inférence à l'échelle de la population

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li