SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization

Large language models (LLMs) exhibit varying levels of confidence acrossinput prompts (questions): some lead to consistent, semantically similaranswers, while others yield diverse or contradictory outputs. This variationreflects LLM's uncertainty about the input prompt, a signal of how confidentlythe model understands a given problem. However, vanilla Group Relative PolicyOptimization (GRPO) treats all prompts equally during policy updates, ignoringthis important information about the model's knowledge boundaries. To addressthis limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), whichexplicitly measures LLMs' uncertainty of the input prompts semantic entropy.Semantic entropy measures the diversity of meaning in multiple generatedanswers given a prompt and uses this to modulate the magnitude of policyupdates. This uncertainty-aware training mechanism enables dynamic adjustmentof policy update magnitudes based on question uncertainty. It allows moreconservative updates on high-uncertainty questions while maintaining theoriginal learning signal on confident ones. Experimental results on fivemathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva34.2, and OlympiadBench 48.0) demonstrate that SEED-GRPO achieves newstate-of-the-art performance in average accuracy, validating the effectivenessof uncertainty-aware policy optimization.