9 hours ago

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Table of Contents

Abstract

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

One-sentence Summary

MaxProof is a population-level test-time scaling framework for the MiniMax-M3 series that integrates proof generation, verification, and critique-conditioned proof repair through a low false-positive generative verifier, applies tournament selection over candidate proofs to produce a final solution, and achieves 35 out of 42 on IMO 2025 and 36 out of 42 on USAMO 2026, exceeding the human gold-medal threshold on both competitions.

Key Contributions

The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.

Introduction

Mathematical proof serves as a rigorous stress test for language model reasoning, demanding precise logical chains with minimal tolerance for error, yet prior systems often struggle with verifier noise, reward hacking, and the absence of executable oracles. The authors develop M3, a model trained through a defense-in-depth generative verifier to integrate proof generation, verification, and critique-conditioned repair while suppressing false positives that typically undermine reinforcement learning stability. They further leverage MaxProof, a population-level test-time scaling framework that orchestrates the model as a generator, verifier, refiner, and ranker to search candidate proofs and finalize selections via tournament competition, achieving gold-medal performance on IMO and USAMO benchmarks.

Dataset

1. Dataset Composition and Sources

The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.

2. Key Details for Each Subset

Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the no_errors and has_errors classes, so the authors rebalance all four verdict categories to prevent model collapse.
Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as minor_gaps, has_errors, or fundamentallyWrong.

3. Data Processing and Training Usage

Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.

4. Metadata Construction and Structural Processing

The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.

Experiment

The evaluation utilizes a two-stage design, first assessing the standalone M3 model on mathematical proof benchmarks to measure baseline reasoning capabilities, and then applying the MaxProof framework on competition problems to validate how population-level search and multi-layer verification transform raw ability into consistent performance. Earlier single-judge training methods proved vulnerable to systematic reward-hacking behaviors like length inflation and rigid templating, which directly informed the creation of a coordinated defense pipeline. Ultimately, the experiments demonstrate that MaxProof effectively bridges the gap between base model potential and reliable contest output through iterative refinement and expert-aligned scoring, while also highlighting that performance ceilings remain tied to underlying model reasoning and occasional ranking inconsistencies.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

9 hours ago

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Table of Contents

Abstract

One-sentence Summary

Key Contributions

The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.

Introduction

Dataset

1. Dataset Composition and Sources

The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.

2. Key Details for Each Subset

Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the no_errors and has_errors classes, so the authors rebalance all four verdict categories to prevent model collapse.
Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as minor_gaps, has_errors, or fundamentallyWrong.

3. Data Processing and Training Usage

Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.

4. Metadata Construction and Structural Processing

The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.

Experiment

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li