منذ 10 ساعات

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

جدول المحتويات

الملخص

نقدم MaxProof، وهو إطار عمل للتوسع أثناء مرحلة الاختبار على مستوى التجمعات، مخصص لإثباتات الرياضيات على مستوى المسابقات ضمن سلسلة MiniMax-M3. يقوم M3 أولاً بتدريب ثلاث قدرات موجهة نحو الإثباتات، وهي: توليد الإثباتات، والتحقق منها، وإصلاحها استناداً إلى النقد، وذلك باستخدام مدقق توليدي يعتمد استراتيجية الدفاع متعدد الطبقات، ومُصمم لتحقيق معدل منخفض من الإيجابيات الكاذبة. وقد تم دمج هذه القدرات في نموذج M3 واحد مُتاح للنشر. أثناء الاختبار، يعامل MaxProof النموذج كمولّد، ومدقّق، ومُحسّن، ومُرتّب، حيث يبحث في تجمّع من الإثباتات المرشحة، ثم يُرجع إثباتاً نهائياً واحداً عبر آلية اختيار البطولة. ومن خلال التوسع أثناء الاختبار باستخدام MaxProof، يصل نموذج M3 إلى نتيجة 35/42 في IMO 2025 و36/42 في USAMO 2026، متجاوزاً بذلك عتبة الميدالية الذهبية البشرية في كل من المسابقتين.

One-sentence Summary

MaxProof is a population-level test-time scaling framework for the MiniMax-M3 series that integrates proof generation, verification, and critique-conditioned proof repair through a low false-positive generative verifier, applies tournament selection over candidate proofs to produce a final solution, and achieves 35 out of 42 on IMO 2025 and 36 out of 42 on USAMO 2026, exceeding the human gold-medal threshold on both competitions.

Key Contributions

The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.

Introduction

Mathematical proof serves as a rigorous stress test for language model reasoning, demanding precise logical chains with minimal tolerance for error, yet prior systems often struggle with verifier noise, reward hacking, and the absence of executable oracles. The authors develop M3, a model trained through a defense-in-depth generative verifier to integrate proof generation, verification, and critique-conditioned repair while suppressing false positives that typically undermine reinforcement learning stability. They further leverage MaxProof, a population-level test-time scaling framework that orchestrates the model as a generator, verifier, refiner, and ranker to search candidate proofs and finalize selections via tournament competition, achieving gold-medal performance on IMO and USAMO benchmarks.

Dataset

1. Dataset Composition and Sources

The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.

2. Key Details for Each Subset

Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the no_errors and has_errors classes, so the authors rebalance all four verdict categories to prevent model collapse.
Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as minor_gaps, has_errors, or fundamentallyWrong.

3. Data Processing and Training Usage

Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.

4. Metadata Construction and Structural Processing

The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.

Experiment

The evaluation utilizes a two-stage design, first assessing the standalone M3 model on mathematical proof benchmarks to measure baseline reasoning capabilities, and then applying the MaxProof framework on competition problems to validate how population-level search and multi-layer verification transform raw ability into consistent performance. Earlier single-judge training methods proved vulnerable to systematic reward-hacking behaviors like length inflation and rigid templating, which directly informed the creation of a coordinated defense pipeline. Ultimately, the experiments demonstrate that MaxProof effectively bridges the gap between base model potential and reliable contest output through iterative refinement and expert-aligned scoring, while also highlighting that performance ceilings remain tied to underlying model reasoning and occasional ranking inconsistencies.

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

منذ 10 ساعات

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

جدول المحتويات

الملخص

One-sentence Summary

Key Contributions

The MiniMax-M3 model integrates proof generation, verification, and critique-conditioned proof repair into a single architecture trained with a defense-in-depth generative verifier engineered for low false-positive rate.
MaxProof operates as a population-level test-time scaling framework that treats the model as a generator, verifier, refiner, and ranker to search candidate proofs and select final outputs via tournament selection.
Applying MaxProof to the released model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds while reaching 67.40 on IMOProofBench and 81.56 on IMOAnswerBench.

Introduction

Dataset

1. Dataset Composition and Sources

The authors compile core training data from public competition mathematics sources, focusing on problem statements paired with human-written reference solutions.
Additional supervision data is harvested directly from the Proof Expert’s reinforcement learning rollouts, requiring no manual annotation.

2. Key Details for Each Subset

Core Math Problems: Each entry is annotated with its mathematical domain, the specific solving technique required, and a reference solution. A grading scheme is auto-generated via few-shot prompting against human exemplars.
Verifier Expert Subset: Extracted as (problem, candidate, analysis, errors, verdict) tuples. The raw distribution heavily favors the no_errors and has_errors classes, so the authors rebalance all four verdict categories to prevent model collapse.
Fixer Expert Subset: Drawn from the same rollout pool but strictly filtered to retain only flawed proofs marked as minor_gaps, has_errors, or fundamentallyWrong.

3. Data Processing and Training Usage

Before entering the RL loop, the authors apply three preprocessing filters: difficulty filtering drops problems the baseline M2.7 model solves reliably, domain balancing equalizes representation across algebra, combinatorics, geometry, and number theory, and trick-frequency control smooths common techniques while preserving rare competition patterns.
Held-out evaluation benchmarks like IMO 2025 and USAMO 2026 are explicitly excluded, and near-duplicate filtering is applied against them.
The dataset is partitioned by prompt rather than by candidate to prevent training-validation leakage.
The final mixture is intentionally non-uniform, designed to maintain useful structural signals while ensuring no single domain or technique dominates the gradient updates.

4. Metadata Construction and Structural Processing

The pipeline constructs a standardized schema where each rollout yields aligned textual critiques and scalar rewards tied to the same failure mode.
Data alignment targets the pessimistic-min teacher signal, ensuring the verifier and fixer experts learn the exact correctness criteria used during proof generation.
Rollout outputs are parsed into fixed structural templates, capturing problem statements, candidate solutions, step-by-step analyses, and verdicts for downstream expert training.

Experiment

ملف PDF المصدر

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

Command Palette

MaxProof: توسيع نطاق الإثبات الرياضي باستخدام التعلّم التعزيزي المولّد-المُحقّق والتوسع أثناء وقت الاختبار على مستوى المجموعة

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

MaxProof: توسيع نطاق الإثبات الرياضي باستخدام التعلّم التعزيزي المولّد-المُحقّق والتوسع أثناء وقت الاختبار على مستوى المجموعة

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

MaxProof: توسيع نطاق الإثبات الرياضي باستخدام التعلّم التعزيزي المولّد-المُحقّق والتوسع أثناء وقت الاختبار على مستوى المجموعة

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li13 more

الملخص

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li

Jiacheng Chen Xinyu Zhang Shunkai Zhang Yanmohan Wang Lin Li Tiancheng Qin Qin Wang Zhengmao Zhu Tianle Li Jingyang Li