Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Bensal, Shelly ; Jamil, Umar ; Bryant, Christopher ; Russak, Melisa ; Kamble, Kiran ; Mozolevskyi, Dmytro ; Ali, Muayad ; AlShikh, Waseem

公開日: 6/4/2025

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

要約

We explore a method for improving the performance of large language modelsthrough self-reflection and reinforcement learning. By incentivizing the modelto generate better self-reflections when it answers incorrectly, we demonstratethat a model's ability to solve complex, verifiable tasks can be enhanced evenwhen generating synthetic data is infeasible and only binary feedback isavailable. Our framework operates in two stages: first, upon failing a giventask, the model generates a self-reflective commentary analyzing its previousattempt; second, the model is given another attempt at the task with theself-reflection in context. If the subsequent attempt succeeds, the tokensgenerated during the self-reflection phase are rewarded. Our experimentalresults show substantial performance gains across a variety of modelarchitectures, as high as 34.7% improvement at math equation writing and 18.1%improvement at function calling. Notably, smaller fine-tuned models (1.5billion to 7 billion parameters) outperform models in the same family that are10 times larger. Our novel paradigm is thus an exciting pathway to more usefuland reliable language models that can self-improve on challenging tasks withlimited external feedback.

論文の詳細を見る