HyperAI

Group Variance Policy Optimization (GVPO) was proposed by the Zuoyebang team in collaboration with the Hong Kong University of Science and Technology (Guangzhou) in April 2025. The related research results were published in the paper "...".GVPO: Group Variance Policy Optimization for Large Language Model Post-Training", was accepted by NeurIPS 2025.

GVPO directly incorporates the analytical solution for KL-constrained reward maximization into its gradient weights, ensuring consistency with the optimal policy. This method provides an intuitive physical interpretation: its gradient reflects the mean squared error between the implicit reward center distance and the actual reward center distance. GVPO has two key advantages: first, it guarantees a unique optimal solution, namely the KL-constrained reward maximization objective; second, it supports flexible sampling distributions, avoiding limitations imposed by policy and importance sampling.

Group Variance Strategy Optimization GVPO

Build AI with AI

Hyper Newsletters

Command Palette

Group Variance Strategy Optimization GVPO

Build AI with AI

Hyper Newsletters