Direct Preference Optimization
Direct Preference Optimization (DPO) is a fine-tuning strategy for aligning large language models (LLMs) with human preferences. It was proposed by a research team from Stanford University and CZ Biohub in 2023 and published in the paperDirect Preference Optimization: Your Language Model is Secretly a Reward Model》, published in NeurIPS 2023.
The core idea of DPO is to optimize directly on human preference data without training a separate reward model or using reinforcement learning. It fine-tunes the language model through binary preference data to make the model more inclined to generate human-preferred answers. Compared with traditional reinforcement learning based on human feedback (RLHF), DPO is simpler, more stable and less computationally expensive. It avoids the fitting process of the reward model by incorporating the preference loss directly into the strategy, and uses the KL divergence constraint to ensure that the model in training does not deviate from the original model.
DPO is proposed to address some limitations of RLHF, such as high computational cost, complex reward modeling, and instability during training. Experiments show that DPO outperforms PPO-based RLHF in controlling generated emotions, and is comparable or improved in summary and single-turn dialogue response quality. In addition, DPO further improves the performance of the model by introducing an offset to handle preference pairs with different preference strengths.