HyperAIHyperAI

Command Palette

Search for a command to run...

Group Variance Strategy Optimization GVPO

Date

10 days ago

Organization

The Hong Kong University of Science and Technology(GuangZhou)

Paper URL

2504.19599

Tags

Group Variance Policy Optimization (GVPO) was proposed by the Zuoyebang team in collaboration with the Hong Kong University of Science and Technology (Guangzhou) in April 2025. The related research results were published in the paper "...".GVPO: Group Variance Policy Optimization for Large Language Model Post-Training", was accepted by NeurIPS 2025.

GVPO directly incorporates the analytical solution for KL-constrained reward maximization into its gradient weights, ensuring consistency with the optimal policy. This method provides an intuitive physical interpretation: its gradient reflects the mean squared error between the implicit reward center distance and the actual reward center distance. GVPO has two key advantages: first, it guarantees a unique optimal solution, namely the KL-constrained reward maximization objective; second, it supports flexible sampling distributions, avoiding limitations imposed by policy and importance sampling.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Group Variance Strategy Optimization GVPO | Wiki | HyperAI