HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

Towards a Unified View of Large Language Model Post-Training

Towards a Unified View of Large Language Model Post-Training

Abstract

Two major sources of training data exist for post-training modern languagemodels: online (model-generated rollouts) data, and offline (human orother-model demonstrations) data. These two types of data are typically used byapproaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT),respectively. In this paper, we show that these approaches are not incontradiction, but are instances of a single optimization process. We derive aUnified Policy Gradient Estimator, and present the calculations of a widespectrum of post-training approaches as the gradient of a common objectiveunder different data distribution assumptions and various bias-variancetradeoffs. The gradient estimator is constructed with four interchangeableparts: stabilization mask, reference policy denominator, advantage estimate,and likelihood gradient. Motivated by our theoretical findings, we proposeHybrid Post-Training (HPT), an algorithm that dynamically selects differenttraining signals. HPT is designed to yield both effective exploitation ofdemonstration and stable exploration without sacrificing learned reasoningpatterns. We provide extensive experiments and ablation studies to verify theeffectiveness of our unified theoretical framework and HPT. Across sixmathematical reasoning benchmarks and two out-of-distribution suites, HPTconsistently surpasses strong baselines across models of varying scales andfamilies.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp