2 months ago

Towards a Unified View of Large Language Model Post-Training

Xingtai Lv Yuxin Zuo Youbang Sun Hongyi Liu Yuntian Wei Zhekai Chen Lixuan He Xuekai Zhu Kaiyan Zhang Bingning Wang

Abstract

Two major sources of training data exist for post-training modern languagemodels: online (model-generated rollouts) data, and offline (human orother-model demonstrations) data. These two types of data are typically used byapproaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT),respectively. In this paper, we show that these approaches are not incontradiction, but are instances of a single optimization process. We derive aUnified Policy Gradient Estimator, and present the calculations of a widespectrum of post-training approaches as the gradient of a common objectiveunder different data distribution assumptions and various bias-variancetradeoffs. The gradient estimator is constructed with four interchangeableparts: stabilization mask, reference policy denominator, advantage estimate,and likelihood gradient. Motivated by our theoretical findings, we proposeHybrid Post-Training (HPT), an algorithm that dynamically selects differenttraining signals. HPT is designed to yield both effective exploitation ofdemonstration and stable exploration without sacrificing learned reasoningpatterns. We provide extensive experiments and ablation studies to verify theeffectiveness of our unified theoretical framework and HPT. Across sixmathematical reasoning benchmarks and two out-of-distribution suites, HPTconsistently surpasses strong baselines across models of varying scales andfamilies.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Towards a Unified View of Large Language Model Post-Training

Xingtai Lv Yuxin Zuo Youbang Sun Hongyi Liu Yuntian Wei Zhekai Chen Lixuan He Xuekai Zhu Kaiyan Zhang Bingning Wang2 more

Abstract

Build AI with AI

Hyper Newsletters

Xingtai Lv Yuxin Zuo Youbang Sun Hongyi Liu Yuntian Wei Zhekai Chen Lixuan He Xuekai Zhu Kaiyan Zhang Bingning Wang