MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
As the era of large language models (LLMs) on behalf of users unfolds,Preference Optimization (PO) methods have become a central approach to aligningLLMs with human preferences and improving performance. We propose Maximum aPosteriori Preference Optimization (MaPPO), a framework for learning frompreferences that explicitly incorporates prior reward knowledge into theoptimization objective. While existing methods such as Direct PreferenceOptimization (DPO) and its variants treat preference learning as a MaximumLikelihood Estimation (MLE) problem, MaPPO extends this paradigm by integratingprior reward estimates into a principled Maximum a Posteriori (MaP) objective.This not only generalizes DPO and its variants, but also enhances alignment bymitigating the oversimplified binary classification of responses. Moreimportantly, MaPPO introduces no additional hyperparameter, and supportspreference optimization in both offline and online settings. In addition, MaPPOcan be used as a plugin with consistent improvement on DPO variants, includingwidely used SimPO, IPO, and CPO. Extensive empirical evaluations of differentmodel sizes and model series on three standard benchmarks, including MT-Bench,AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements inalignment performance without sacrificing computational efficiency.