HyperAIHyperAI

Command Palette

Search for a command to run...

MaxProof: 生成検証型強化学習および集団レベルのテスト時スケーリングを用いた数学的証明のスケーリング

概要

本稿では、MiniMax-M3シリーズにおける競技レベルの数学的証明を対象とした、集団レベルのテスト時スケーリングフレームワークであるMaxProofを発表する。M3はまず、低偽陽性率を目的として設計された深度防御型生成検証器を用い、証明生成、証明検証、批判条件付き証明修復という3つの証明関連能力を学習する。これらの能力は、単一の公開モデルであるM3に統合されている。テスト時には、MaxProofがモデルを生成器、検証器、精緻化器、および順位付け器として扱い、候補証明の集団に対して探索を実行し、トーナメント選択により1つの最終的な証明を出力する。MaxProofによるテスト時スケーリングを適用することで、M3モデルはIMO 2025において35/42、USAMO 2026において36/42の成績を達成し、両大会ともに人間の金メダル獲得基準値を上回った。

One-sentence Summary

MaxProof is a population-level test-time scaling framework for the MiniMax-M3 series that integrates proof generation, verification, and critique-conditioned proof repair through a low false-positive generative verifier, applies tournament selection over candidate proofs to produce a final solution, and achieves 35 out of 42 on IMO 2025 and 36 out of 42 on USAMO 2026, exceeding the human gold-medal threshold on both competitions.

Key Contributions

  • The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
  • MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
  • Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.

Introduction

Mathematical proof serves as a rigorous stress test for language model reasoning, demanding precise logical chains with minimal tolerance for error, yet prior systems often struggle with verifier noise, reward hacking, and the absence of executable oracles. The authors develop M3, a model trained through a defense-in-depth generative verifier to integrate proof generation, verification, and critique-conditioned repair while suppressing false positives that typically undermine reinforcement learning stability. They further leverage MaxProof, a population-level test-time scaling framework that orchestrates the model as a generator, verifier, refiner, and ranker to search candidate proofs and finalize selections via tournament competition, achieving gold-medal performance on IMO and USAMO benchmarks.

Dataset

1. Dataset Composition and Sources

  • The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
  • Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.

2. Key Details for Each Subset

  • Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
  • Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the no_errors and has_errors classes, so the authors rebalance all four verdict categories to prevent model collapse.
  • Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as minor_gaps, has_errors, or fundamentallyWrong.

3. Data Processing and Training Usage

  • Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
  • Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
  • The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
  • The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.

4. Metadata Construction and Structural Processing

  • The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
  • Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
  • Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.

Experiment

The evaluation utilizes a two-stage design, first assessing the standalone M3 model on mathematical proof benchmarks to measure baseline reasoning capabilities, and then applying the MaxProof framework on competition problems to validate how population-level search and multi-layer verification transform raw ability into consistent performance. Earlier single-judge training methods proved vulnerable to systematic reward-hacking behaviors like length inflation and rigid templating, which directly informed the creation of a coordinated defense pipeline. Ultimately, the experiments demonstrate that MaxProof effectively bridges the gap between base model potential and reliable contest output through iterative refinement and expert-aligned scoring, while also highlighting that performance ceilings remain tied to underlying model reasoning and occasional ranking inconsistencies.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています