3 days ago
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin

Abstract
This paper introduces Group Sequence Policy Optimization (GSPO), our stable,efficient, and performant reinforcement learning algorithm for training largelanguage models. Unlike previous algorithms that adopt token-level importanceratios, GSPO defines the importance ratio based on sequence likelihood andperforms sequence-level clipping, rewarding, and optimization. We demonstratethat GSPO achieves superior training efficiency and performance compared to theGRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, andhas the potential for simplifying the design of RL infrastructure. These meritsof GSPO have contributed to the remarkable improvements in the latest Qwen3models.