Reinforcement Fine-Tuning
Reinforcement Fine-Tuning (RFT) is a method that combines supervised fine-tuning (SFT) and reinforcement learning (RL). It aims to optimize the model's ability to generate high-quality answers by introducing the learning of multiple reasoning paths and automatically evaluating the degree of match between these paths and the correct answers.
RFT was first proposed by ByteDance in 2024.ReFT: Reasoning with REinforced Fine-Tuning" has been published in ACL 2024. This technology improves model performance through two stages: the first is the warm-up stage, which uses SFT to warm up the model and provide a foundation for the model to generate basic correct responses to mathematical problems; the second is the reinforcement learning (RL) stage, which uses online reinforcement learning (specifically the PPO algorithm) for optimization, by automatically sampling a large number of reasoning paths and obtaining rewards based on the real answers to further fine-tune the model.
RFT shows better performance than SFT on multiple datasets, especially on the CodeLLAMA model, where the accuracy of RFT on the GSM8K dataset is nearly 10 percentage points higher than that of SFT. This technology allows the model to not only learn the answer, but also optimize the thinking path according to the task requirements, build a "feedback loop" for the model, and guide the scoring of the model output by the domain-specific scorer to train solutions that adapt to the needs of specific scenarios.