Reinforcement Learning Fails to Significantly Enhance Large Language Models' Reasoning Abilities

Recently, reinforcement learning with verifiable rewards (RLVR) has shown significant success in enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks involving mathematics and programming. The prevailing belief is that RLVR enables LLMs to continuously improve and develop new reasoning abilities that surpass those of their base models. However, this study critically re-evaluates that assumption by measuring pass@k metrics using larger k values to explore the boundaries of reasoning capacity across different model families and benchmarks. Surprisingly, the research found that RL does not actually trigger fundamentally new reasoning patterns. While RL-trained models outperform their base models with smaller k values, such as k=1, they fall behind or match the performance of the base models when k is larger. This suggests that the reasoning paths generated by RL-trained models are already within the sampling distribution of the base models, indicating that most of the reasoning capabilities attributed to RL training were already present in the original models. Further analysis revealed that RL improves performance by adjusting the model’s output distribution to favor paths that are more likely to receive rewards. This makes it more efficient at selecting correct answers. However, this bias also narrows the range of reasoning abilities compared to the base models. Similar results were observed in visual reasoning tasks where RLVR was used, reinforcing the idea that RL’s impact on reasoning is more about efficiency rather than expansion of capabilities. Additionally, the study noted that knowledge distillation, unlike RLVR, can indeed introduce new knowledge into the models. This finding highlights the key limitations of RLVR in advancing the reasoning abilities of LLMs, prompting a reexamination of the role of RL training in this context and the exploration of alternative paradigms that might better foster the development of new reasoning skills. The project page, which includes detailed methodologies and findings, can be accessed at https://limit-of-RLVR.github.io. These insights have important implications for the future of LLM development, suggesting that while RLVR is effective in optimizing existing reasoning pathways, it may not be the best approach for expanding the cognitive horizons of these models. Researchers and developers in the field should consider combining RLVR with other techniques, such as knowledge distillation, to create more robust and versatile LLMs capable of both efficient and novel reasoning.

Reinforcement Learning Fails to Significantly Enhance Large Language Models' Reasoning Abilities

Related Links