Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Mathematical reasoning presents a significant challenge for Large LanguageModels (LLMs) due to the extensive and precise chain of reasoning required foraccuracy. Ensuring the correctness of each reasoning step is critical. Toaddress this, we aim to enhance the robustness and factuality of LLMs bylearning from human feedback. However, Direct Preference Optimization (DPO) hasshown limited benefits for long-chain mathematical reasoning, as modelsemploying DPO struggle to identify detailed errors in incorrect answers. Thislimitation stems from a lack of fine-grained process supervision. We propose asimple, effective, and data-efficient method called Step-DPO, which treatsindividual reasoning steps as units for preference optimization rather thanevaluating answers holistically. Additionally, we have developed a dataconstruction pipeline for Step-DPO, enabling the creation of a high-qualitydataset containing 10K step-wise preference pairs. We also observe that in DPO,self-generated data is more effective than data generated by humans or GPT-4,due to the latter's out-of-distribution nature. Our findings demonstrate thatas few as 10K preference data pairs and fewer than 500 Step-DPO training stepscan yield a nearly 3% gain in accuracy on MATH for models with over 70Bparameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achievesscores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively,surpassing a series of closed-source models, including GPT-4-1106,Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available athttps://github.com/dvlab-research/Step-DPO.