Command Palette
Search for a command to run...

Abstract
Recent advances in reinforcement learning for foundation models, such asGroup Relative Policy Optimization (GRPO), have significantly improved theperformance of foundation models on reasoning tasks. Notably, the advantagefunction serves as a central mechanism in GRPO for ranking the trajectoryimportance. However, existing explorations encounter both advantage reversionand advantage mirror problems, which hinder the reasonable advantage allocationacross different query samples. In this work, we propose an easy but effectiveGRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that thetrajectory appears with different certainty and propose the advantage percentdeviation for samples with high-certainty trajectories. Furthermore, wedynamically reweight the advantage function for samples with varying trajectorycertainty, thereby adaptively configuring the advantage function to account forsample-specific characteristics. Comparison with related state-of-the-artmethods, along with ablation studies on different advantage variants, validatesthe effectiveness of our approach.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.