HyperAI
Back to Headlines

ReasonFlux-PRM: New Model Evaluates Both Intermediate Steps and Final Answers to Improve Reasoning in LLMs

2 days ago

Scale AI, a leading data-labeling company, has confirmed a significant investment from Meta, valuing the startup at $29 billion. Co-founder and CEO Alexandr Wang is stepping down to join Meta, where he will focus on the company's superintelligence initiatives. Jason Droege, Scale’s current Chief Strategy Officer, will take over as interim CEO. Despite the investment, Scale AI will maintain its independence, and Wang will remain on its board of directors. The investment highlights Meta's commitment to enhancing its AI capabilities as it competes against companies like Google, OpenAI, and Anthropic. Meta’s share purchase for about $14.3 billion secures a 49% stake in Scale AI, reflecting the growing importance of high-quality training data for advanced AI models. Scale AI has been a crucial player in the AI ecosystem, providing structured data for training large language models (LLMs). In recent months, the company has expanded its team, recruiting highly skilled professionals such as PhD researchers and senior software engineers to improve data quality and support the development of next-generation AI systems. Introducing ReasonFlux-PRM Researchers from the University of Illinois Urbana-Champaign (UIUC), Princeton University, Cornell University, and ByteDance Seed have developed a novel Process Reward Model (PRM) called ReasonFlux-PRM. This model is designed to evaluate both the intermediate reasoning steps and the final answers produced by LLMs, addressing a critical gap in current PRM methodologies. Traditional PRMs and Their Limitations Most existing PRMs focus solely on the final output, neglecting the reasoning process that leads to it. This approach is insufficient for advanced LLMs like Deepseek-R1, which generate extensive reasoning chains before providing a final response. These reasoning paths are often reused to train smaller models, but the lack of trajectory-level evaluation results in unreliable supervision and potentially degraded performance. Even high-capacity PRMs, like Qwen2.5-Math-PRM-72B, struggle to distinguish between high- and low-quality intermediate reasoning. When applied to trajectory-response data from models like Gemini and Deepseek-R1, these PRMs often produce overlapping reward scores, indicating poor sensitivity and subpar data selection for downstream fine-tuning. Experiments have shown that models trained on PRM-selected data perform worse compared to those trained on human-curated datasets. Technical Features of ReasonFlux-PRM ReasonFlux-PRM operates by scoring each intermediate step within a reasoning trajectory based on its contribution to the final answer. The model uses a reference reward function that takes into account the initial prompt, previous reasoning steps, and the final output to assign step-level scores. These scores are then aggregated to produce a comprehensive trajectory reward. Key features and applications of ReasonFlux-PRM include: - Offline Filtering: Selecting high-quality training data by evaluating intermediate steps. - Dense Reward Provision: Providing detailed feedback during reinforcement learning to optimize model performance. - Best-of-N Test-Time Response Selection: Enhancing inference quality by selecting the best response from multiple options. Empirical Results Performance evaluations of ReasonFlux-PRM across various benchmarks, including AIME, MATH500, and GPQA-Diamond, demonstrated significant improvements over traditional PRMs. Specifically: - Supervised Fine-Tuning: A 12.1% accuracy gain when training models on ReasonFlux-PRM-selected data. - Reinforcement Learning: A 4.5% improvement in performance when using dense rewards. - Test-Time Scaling: A 6.3% increase in accuracy during test-time response selection. These results are particularly impressive considering that ReasonFlux-PRM is a much smaller model (7B parameters) compared to the high-capacity Qwen2.5-Math-PRM-72B. For example, the Qwen2.5-14B-Instruct model, which was trained on data selected by ReasonFlux-PRM, achieved performance levels comparable to or better than those trained on human-curated datasets. In contrast, other PRMs caused significant drops, with some models losing up to 26.6% in certain benchmarks. Impact and Future Directions ReasonFlux-PRM represents a significant advancement in the evaluation and supervision of reasoning models. By assessing both the reasoning process and the final outputs, it provides more reliable and high-quality training data, ultimately leading to better-performing AI systems. This trajectory-aware approach sets a new standard for systematic evaluation and improvement of reasoning processes in LLMs, potentially revolutionizing how AI models are trained and refined. Industry insiders and researchers believe that ReasonFlux-PRM could play a pivotal role in closing the performance gap between current AI models and human-curated standards. Its flexible and comprehensive framework may also inspire the development of similar models, further advancing the field of AI. Scale AI, known for its expertise in data labeling, stands to benefit from the broader adoption of trajectory-aware evaluation methods, reinforcing its position as a key player in the AI industry.

Related Links