HyperAIHyperAI

Command Palette

Search for a command to run...

長さバイアスのないシーケンス方策最適化:RLVRにおける応答長さの変動の解明と制御

Fanfan Liu Youyang Yin Peng Shi Siqi Yang Zhixiong Zeng Haibo Qiu

概要

最近、検証可能な報酬を用いた強化学習(Reinforcement Learning with Verifiable Rewards: RLVR)が大規模言語モデル(Large Language Models: LLMs)および視覚言語モデル(Vision-Language Models: VLMs)に応用され、複雑なタスクにおける推論能力の向上において顕著な成果を上げている。RLVRの訓練過程において、応答長の増加は推論能力の向上に寄与する主要因と広く認識されている。しかし、異なるRLVRアルゴリズム間では、訓練中に応答長がどのように変化するかというパターンに顕著な差異が見られる。こうした変化の背後にある根本的な理由を明らかにするために、本研究では主流のRLVRアルゴリズムの構成要素を詳細に分析し、応答長に影響を与える要因について理論的考察を展開した。さらに、広範な実験を通じてその理論を検証した。これらの理論的知見を基盤として、長さに偏りのない逐次方策最適化(Length-Unbiased Sequence Policy Optimization: LUSPO)アルゴリズムを提案する。具体的には、グループ逐次方策最適化(Group Sequence Policy Optimization: GSPO)に内在する応答長へのバイアスを修正し、損失関数を応答長に対して不偏なものにすることで、応答長の崩壊問題を解決した。数学的推論ベンチマークおよびマルチモーダル推論シナリオにおいて広範な実験を実施した結果、LUSPOは一貫して優れた性能を発揮した。実証的な結果から、GRPOやGSPOといった既存手法と比較して、LUSPOは新たな最先端の最適化戦略であることが示された。

One-sentence Summary

Researchers from Meituan propose LUSPO, a length-unbiased RLVR algorithm that corrects GSPO’s response length bias, enabling stable reasoning growth in LLMs and VLMs across math and multimodal tasks, outperforming GRPO and GSPO without length collapse.

Key Contributions

  • We identify and theoretically explain how GRPO and GSPO introduce length bias during RLVR training, causing models to favor shorter responses under GSPO and undermining reasoning performance.
  • We propose LUSPO, a length-unbiased policy optimization method that scales sequence loss by response length to eliminate this bias, enabling stable training and accelerated growth in reasoning depth.
  • Empirical results across dense and MoE models on benchmarks like AIME24, MathVista, and MathVision show LUSPO outperforms GRPO and GSPO, achieving up to 6.9% higher accuracy on reasoning tasks.

Introduction

The authors leverage Reinforcement Learning with Verifiable Rewards (RLVR) to improve reasoning in large language and vision-language models, where response length often correlates with reasoning depth. Prior methods like GRPO and GSPO suffer from implicit length bias: GRPO penalizes longer correct responses, while GSPO’s sequence-level clipping exacerbates bias by disproportionately suppressing negative samples, leading to response collapse during training. To fix this, they propose Length-Unbiased Sequence Policy Optimization (LUSPO), which scales each sequence’s loss by its length to neutralize bias. LUSPO stabilizes training across dense and MoE models, accelerates response length growth, and improves accuracy on math and multimodal benchmarks without requiring architectural changes.

Top Figure

Dataset

  • The authors use two primary datasets: DAPO-MATH-17K for training the main model and ViRL39K for training the vision-language (VL) model.
  • Both datasets are sourced from recent academic work (Yu et al., 2025 and Wang et al., 2025) and focus on scientific problem-solving, with an emphasis on math and logic.
  • The datasets serve as core evaluation benchmarks, chosen for their rigor and extensibility to other domains.
  • No further details on subset sizes, filtering rules, or processing steps are provided in this section.

Method

The authors leverage a policy optimization framework for autoregressive language models, treating the model as a policy πθ\pi_{\theta}πθ that generates responses yyy conditioned on queries xxx. Each response is evaluated by a verifier that assigns a scalar reward r(x,y)r(x, y)r(x,y), forming the basis for reinforcement learning updates. The core innovation lies in the design of group-based policy optimization objectives that compare multiple responses per query to compute relative advantages, thereby stabilizing training and reducing variance.

The foundational method, Group Relative Policy Optimization (GRPO), samples a group of GGG responses per query from the old policy πθold\pi_{\theta_{\text{old}}}πθold, computes their rewards, and constructs a token-level objective that incorporates clipped importance sampling weights. The importance ratio wi,t(θ)w_{i,t}(\theta)wi,t(θ) for token yi,ty_{i,t}yi,t is defined as the ratio of current to old policy probabilities, while the advantage A^i,t\widehat{A}_{i,t}Ai,t is shared across all tokens in response yiy_iyi and normalized relative to the group’s mean and standard deviation of rewards. This design encourages the model to favor responses that outperform the group average.

Building on this, Group Sequence Policy Optimization (GSPO) replaces token-level importance weights with a sequence-level counterpart si(θ)s_i(\theta)si(θ), defined as the geometric mean of token-level ratios over the entire response. This aligns more naturally with sequence-level rewards and provides a theoretically grounded basis for clipping. The GSPO objective retains the group-based advantage estimation but applies the sequence-level importance ratio directly to the advantage, simplifying the gradient computation and improving reward signal coherence.

However, both GRPO and GSPO exhibit a response-length bias: shorter responses receive higher per-token weight in the loss, leading the model to favor brevity, especially under GSPO’s sequence-level clipping regime. To address this, the authors introduce Length-Unbiased Sequence Policy Optimization (LUSPO), which scales each sequence’s contribution to the loss by its own length yi|y_i|yi. This simple modification ensures that longer responses are not penalized for their length, thereby eliminating the gradient bias inherent in GSPO.

The gradient analysis confirms that LUSPO’s objective yields a gradient expression where the length normalization factor cancels out, leaving a clean sum over token-level policy gradients weighted by the sequence-level advantage and importance ratio. In contrast, GSPO’s gradient retains a 1/yi1/|y_i|1/∣yi factor, which introduces length-dependent scaling. This theoretical insight validates LUSPO’s design as a principled correction to the length bias.

During training, the reward function combines three components: accuracy (Raccuracy{0,1}\mathcal{R}_{accuracy} \in \{0,1\}Raccuracy{0,1}), format adherence (Rformat{0,0.5}\mathcal{R}_{format} \in \{0,0.5\}Rformat{0,0.5}), and a penalty for overlong responses Roverlong(y)\mathcal{R}_{overlong}(y)Roverlong(y), which linearly penalizes responses exceeding a buffer length LbufferL_{buffer}Lbuffer relative to the maximum allowed length LmaxL_{max}Lmax. This composite reward encourages both correctness and conciseness while maintaining structural compliance with prompt requirements.

Experiment

  • LUSPO consistently outperforms GSPO and GRPO across dense, MoE, and vision-language models, showing strong generalization in both text-only and multimodal benchmarks.
  • LUSPO mitigates GSPO’s length bias, leading to significantly longer and more stable response lengths during training, which enhances exploration and complex reasoning.
  • Models trained with LUSPO achieve higher accuracy rewards and better validation scores, indicating improved learning and generalization rather than overfitting.
  • Ablation studies confirm LUSPO’s robustness across diverse datasets, maintaining superior performance even when length collapse is not inherent to the training data.

The authors evaluate LUSPO against GRPO and GSPO on multimodal benchmarks using Qwen2.5-VL-7B-Instruct, showing consistent performance gains across tasks. LUSPO outperforms GSPO by up to 6.0% on LogicVista and 5.1% on WeMath, while also maintaining longer response lengths during training. These results indicate LUSPO’s superior generalization in vision-language settings by mitigating the length bias inherent in GSPO.

The authors use LUSPO to train both dense and MoE models, observing that LUSPO consistently generates significantly longer responses than GSPO across model types. This increased response length correlates with better performance on validation benchmarks, suggesting LUSPO mitigates the length bias inherent in GSPO. Results show LUSPO enhances model capability by enabling more extensive exploration and complex reasoning during training.

The authors evaluate LUSPO against GSPO on multimodal benchmarks using Qwen2.5-VL-7B-Instruct, showing consistent performance gains across all tasks. LUSPO outperforms GSPO by up to 6.0% on specific benchmarks, with an overall average improvement of 2.3 points. These results highlight LUSPO’s superior generalization in vision-language settings compared to the baseline.

The authors use LUSPO to train both dense and MoE models, achieving consistent performance gains over GSPO across multiple text-only benchmarks. Results show that LUSPO not only improves average scores but also delivers larger absolute improvements on challenging tasks like AIME25 and MATH500, indicating stronger generalization. The gains are especially pronounced in the MoE model, where LUSPO significantly boosts performance on AIME benchmarks compared to GSPO.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています