a month ago

MAPO: Mixed Advantage Policy Optimization

Wenke Huang Quan Zhang Yiyang Fang Jian Liang Xuankun Rong Huanjin Yao Guancheng Wan Ke Liang Wenwen He Mingjun Li

Abstract

Recent advances in reinforcement learning for foundation models, such asGroup Relative Policy Optimization (GRPO), have significantly improved theperformance of foundation models on reasoning tasks. Notably, the advantagefunction serves as a central mechanism in GRPO for ranking the trajectoryimportance. However, existing explorations encounter both advantage reversionand advantage mirror problems, which hinder the reasonable advantage allocationacross different query samples. In this work, we propose an easy but effectiveGRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that thetrajectory appears with different certainty and propose the advantage percentdeviation for samples with high-certainty trajectories. Furthermore, wedynamically reweight the advantage function for samples with varying trajectorycertainty, thereby adaptively configuring the advantage function to account forsample-specific characteristics. Comparison with related state-of-the-artmethods, along with ablation studies on different advantage variants, validatesthe effectiveness of our approach.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MAPO: Mixed Advantage Policy Optimization

Wenke Huang Quan Zhang Yiyang Fang Jian Liang Xuankun Rong Huanjin Yao Guancheng Wan Ke Liang Wenwen He Mingjun Li4 more

Abstract

Build AI with AI

Hyper Newsletters

Wenke Huang Quan Zhang Yiyang Fang Jian Liang Xuankun Rong Huanjin Yao Guancheng Wan Ke Liang Wenwen He Mingjun Li