HyperAIHyperAI

Command Palette

Search for a command to run...

MathNet: 수학적 추론 및 검색을 위한 글로벌 멀티모달 벤치마크

Shaden Alshammari Kevin Wen Abrar Zainal Mark Hamilton Navid Safaei Sultan Albarakati William T. Freeman Antonio Torralba

초록

수학적 문제 해결은 대규모 언어 모델(Large Language Models) 및 멀티모달 모델(Multimodal Models)의 추론 능력을 시험하는 어려운 과제로 남아 있으나, 기존의 benchmark들은 규모, 언어 커버리지 및 작업의 다양성 측면에서 한계가 있습니다. 본 논문에서는 올림피아드 수준의 수학 문제로 구성된 고품질·대규모·멀티모달·다국어 데이터셋인 MATHNETMATHNETMATHNET을 소개합니다. 이와 더불어 generative model의 수학적 추론 능력과 embedding 기반 시스템의 수학적 검색(Mathematical Retrieval) 능력을 평가하기 위한 benchmark를 함께 제안합니다.MATHNETMATHNETMATHNET은 47개국, 17개 언어, 그리고 지난 20년간의 경시 대회를 아우르며, 다양한 분야에 걸쳐 전문가가 직접 작성한 해설을 포함한 30,676개의 문제를 포함하고 있습니다. 핵심 데이터셋 외에도, 인적 전문가가 선별한 수학적으로 동등하고 구조적으로 유사한 문제 쌍(problem pairs)으로 구성된 검색 benchmark를 구축하였습니다. MATHNETMATHNETMATHNET은 다음 세 가지 작업을 지원합니다: (i) Problem Solving, (ii) Math-Aware Retrieval, (iii) Retrieval-Augmented Problem Solving.실험 결과에 따르면, 최첨단(state-of-the-art) 추론 모델들(Gemini-3.1-Pro 78.4%, GPT-5 69.3%)조차 여전히 어려움을 겪고 있으며, embedding 모델들은 동등한 문제를 검색하는 데 난항을 겪는 것으로 나타났습니다. 나아가 RAG 성능은 검색 품질에 매우 민감하게 반응함을 확인하였습니다. 예를 들어, DeepSeek-V3.2-Speciale은 최대 12%의 성능 향상을 달성하며 본 benchmark에서 가장 높은 점수를 기록했습니다.

One-sentence Summary

Researchers introduce MathNet, a large-scale multimodal and multilingual benchmark comprising 30,676 expert-authored Olympiad-level math problems across 17 languages that evaluates mathematical reasoning in generative models, math-aware retrieval in embedding-based systems, and retrieval-augmented problem solving.

Key Contributions

  • The paper introduces MATHNET, a large-scale, multimodal, and multilingual dataset comprising 30,676 expert-authored Olympiad-level math problems and solutions spanning 47 countries and 17 languages.
  • This work establishes a specialized retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts to evaluate embedding-based systems.
  • Experimental results demonstrate the benchmark's utility across three tasks, showing that even state-of-the-art reasoning models struggle with the difficulty and that RAG performance is highly sensitive to retrieval quality.

Introduction

Mathematical problem solving serves as a fundamental benchmark for evaluating the reasoning capabilities of artificial intelligence. While existing datasets cover various domains, they often suffer from limited scale, a lack of multilingual and multimodal content, or insufficient expert-validated solutions for high-difficulty Olympiad level problems. Furthermore, prior retrieval systems frequently struggle to bridge the gap between symbolic equivalence and semantic similarity. The authors address these gaps by introducing MathNet, a large-scale multilingual and multimodal dataset featuring expert-validated problem pairs. By incorporating a fine-grained taxonomy of mathematical similarity, the authors enable more rigorous research into analogical reasoning, retrieval quality, and retrieval-augmented generation across different languages and modalities.

Dataset

Dataset Overview: MathNet

The authors introduce MathNet, a large scale multimodal and multilingual dataset designed to evaluate mathematical reasoning and retrieval. Unlike existing benchmarks that rely on community platforms, MathNet is built exclusively from official national competition booklets spanning 40 years (1985 to 2025).

Dataset Composition and Subsets

The corpus is divided into three specialized datasets:

  • MathNet-Solve: The core collection containing 30,676 expert authored Olympiad problems and solutions. It covers 143 competitions across 47 countries and 17 languages. The subset is organized into three splits:
    • MathNet-Solve-train: 23,776 samples.
    • MathNet-Solve-test: 6,400 samples.
    • MathNet-Solve-test-hard: 500 samples.
  • MathNet-Retrieve: A retrieval evaluation dataset consisting of 40,000 problems. It is constructed by taking 10,000 anchor problems from MathNet-Solve and generating one mathematically equivalent positive variant and three adversarial "hard negative" variants for each.
  • MathNet-RAG: A non synthetic evaluation set for retrieval augmented generation. It consists of 70 total problems: 35 anchors paired with 35 expert curated real problems drawn from MathNet-Solve.

Processing and Extraction Pipeline

The authors employed a sophisticated multi stage pipeline to convert heterogeneous PDF volumes into a uniform format:

  • Document Parsing: The team used the dots-ocr framework to convert 1,595 PDF volumes (over 25,000 pages) into Markdown, handling both digital typeset and scanned documents.
  • Problem-Solution Alignment: To manage inconsistent numbering and interleaved layouts, the authors used a three stage LLM based pipeline. Gemini-2.5-Flash was used for document segmentation, and GPT-4.1 was used to extract problems and solutions into LaTeX friendly formats.
  • Verification: Extracted pairs underwent a rigorous three tier validation process involving a rule based analytical checker, GPT-4.1 visual inspection of source screenshots, and manual human review. A pair was only retained if all three mechanisms reached a unanimous agreement.
  • Metadata Construction: The pipeline captures detailed provenance for each entry, including country, competition name, year, and author notes.

Usage and Evaluation Tasks

The authors use the data to benchmark models across three distinct mathematical tasks:

  1. Problem Solving: Evaluating generative models by comparing their generated solutions against the expert reference solutions in MathNet-Solve.
  2. Math-Aware Retrieval: Testing embedding based systems on their ability to identify mathematically equivalent problems in MathNet-Retrieve, moving beyond simple lexical overlap.
  3. Retrieval-Augmented Problem Solving (RAG): Assessing how retrieval quality impacts reasoning performance using the expert paired problems in MathNet-RAG.

Method

The authors propose a comprehensive pipeline designed to transform raw mathematical documents into high-quality, structured datasets suitable for training and evaluating large language models. The overall workflow begins with the ingestion of diverse source materials, which include over 30,000 Olympiad-level problems spanning more than 40 countries and four decades of mathematical history.

As shown in the figure below:

The system processes input booklets through a multi-stage extraction and segmentation pipeline. Initially, PDF files undergo text extraction using DotsOCR, while page screenshots are captured to preserve visual information. These components are then converted into Markdown format. The core of the processing engine involves document segmentation, where the text is partitioned into discrete problem and solution blocks. This segmentation is supported by a specialized module that performs structural parsing, boundary detection, and problem-solution alignment to ensure that each mathematical challenge is correctly paired with its corresponding rigorous proof or answer.

Following segmentation, the system employs GPT-4.1 for format normalization, ensuring that the extracted content adheres to a consistent structure. This is followed by a semantic metadata extraction phase, which categorizes problems by topic, difficulty, and type. To maintain the highest possible data integrity, the pipeline incorporates a human verification step. The final stage involves a rule-based source consistency check and a cross-verification and deduplication process, resulting in a curated dataset of high-quality, human-verified mathematical problem-solution pairs.

The resulting dataset supports three primary evaluation tasks: math comprehension, problem retrieval, and Math RAG (Retrieval-Augmented Generation). This structured approach allows for the rigorous benchmarking of LLMs, MLLMs, and embedding models against complex mathematical reasoning tasks.

Experiment

The MATHNET evaluation assesses model capabilities across three distinct tasks: direct problem solving, math-aware retrieval of equivalent problems, and retrieval-augmented problem solving. While frontier reasoning models demonstrate impressive performance in solving complex mathematical problems, current embedding models struggle significantly with math-aware retrieval due to a reliance on superficial lexical overlap rather than deep structural understanding. Consequently, retrieval-augmented generation only provides consistent improvements when the retrieved context is mathematically and structurally aligned with the target problem.

The authors evaluate various embedding models on the MathNet-Retrieve benchmark to assess their ability to perform math-aware retrieval across different mathematical domains. Results show that while retrieval accuracy is generally low at the top-1 level, performance improves significantly as the number of retrieved candidates increases. Gemini-embedding-001 achieves the highest overall performance for both Recall@1 and Recall@5 across all domains. Retrieval accuracy is consistently higher at the Recall@5 level compared to the Recall@1 level across all tested models and subjects. Legacy text embedding models generally demonstrate lower retrieval performance compared to newer specialized or larger embedding models.

The authors evaluate the impact of retrieval-augmented generation on mathematical problem solving using both human and LLM grading. Results show that providing expert-curated, structurally similar problems as context generally improves performance compared to zero-shot settings. Expert-RAG consistently yields the highest accuracy under human grading across most evaluated models Retrieval-based augmentation provides significant performance gains for mid-tier models compared to zero-shot baselines LLM grading results broadly align with human expert assessments regarding model performance trends

The authors evaluate the impact of retrieval-augmented generation on mathematical problem solving using human and LLM grading. Results show that providing ground-truth related problems through expert-RAG consistently improves performance compared to zero-shot settings. Expert-RAG provides the most significant and consistent performance gains across most models Embedding-based retrieval results show high variance and can sometimes lead to performance degradation The performance boost from retrieval is particularly notable for lower and mid-tier solvers

The authors evaluate the problem solving accuracy of various models on the MathNet-Solve-Test dataset across multiple languages. Results show that frontier reasoning models achieve the highest overall performance, while performance varies significantly depending on the specific language used. Frontier models like gemini-3.1-pro demonstrate the highest overall accuracy compared to other tested systems Model performance is notably higher on certain languages such as Italian and Portuguese compared to others like Chinese There is a substantial performance gap between state of the art reasoning models and smaller or earlier generation models

The authors compare MathNet against several existing mathematical reasoning benchmarks across various dimensions including size, language coverage, and difficulty. MathNet distinguishes itself by offering a larger scale and broader multilingual support compared to the surveyed datasets. MathNet provides significantly larger scale and more extensive language support than existing benchmarks While most existing benchmarks focus on grade school or high school levels, MathNet focuses on Olympiad level difficulty MathNet utilizes expression and proof based evaluation types to assess mathematical reasoning

The authors evaluate the MathNet benchmark by assessing embedding model retrieval capabilities, the effectiveness of retrieval-augmented generation (RAG) on problem-solving accuracy, and the performance of various reasoning models across multiple languages. The results demonstrate that specialized embedding models and expert-curated RAG significantly enhance mathematical reasoning, particularly for mid-tier models, while frontier reasoning models maintain a substantial performance lead across diverse linguistic contexts. Ultimately, MathNet distinguishes itself from existing benchmarks by providing a larger scale, broader multilingual support, and higher difficulty levels focused on Olympiad-level mathematical reasoning.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
MathNet: 수학적 추론 및 검색을 위한 글로벌 멀티모달 벤치마크 | 문서 | HyperAI초신경