ProcessBench Mathematical Reasoning Benchmark Dataset
Date
Size
Publish URL
Categories
ProcessBench is a benchmark dataset that focuses on identifying errors in mathematical reasoning. It aims to measure the ability of language models to identify incorrect steps in mathematical reasoning. It was launched by the Alibaba Group’s Qwen team in 2024. The related paper results are “ProcessBench: Identifying Process Errors in Mathematical Reasoning".
This dataset contains 3.4k test examples, focusing on math problems of competition and Olympic difficulty. Each example is equipped with a step-by-step solution, and the errors are accurately marked by domain experts. When constructing this dataset, the research team selected questions from multiple public data sources, used various open source language models to produce solutions, and finally reviewed by experts to ensure the high quality of the data.
