vor einem Monat

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng

Inhaltsverzeichnis

Zusammenfassung

Mit zunehmender Leistungsfähigkeit von Sprachmodellen steigen die Erwartungen der Benutzer dahingehend, dass diese nicht nur präzise Antworten liefern, sondern auch Verhaltensweisen zeigen, die den vielfältigen menschlichen Präferenzen in unterschiedlichen Szenarien entsprechen. Um dies zu erreichen, werden in der Forschung zur Verstärkungslernverfahren (Reinforcement Learning, RL) zunehmend mehrere Belohnungen in die Trainingspipelines integriert, wobei jede Belohnung eine spezifische Präferenz erfassen soll, um das Modell in Richtung gewünschter Verhaltensweisen zu führen. Allerdings hat sich in jüngster Zeit die Verwendung von Group Relative Policy Optimization (GRPO) unter mehrfachen Belohnungsbedingungen durchgesetzt, ohne dass deren Eignung systematisch untersucht wurde. In dieser Arbeit zeigen wir, dass die direkte Anwendung von GRPO zur Normalisierung verschiedener Kombinationen von Rollout-Belohnungen dazu führt, dass diese in identische Vorteilswerte kollabieren. Dies verringert die Auflösung des Trainingssignals und führt zu suboptimaler Konvergenz, in einigen Fällen sogar zu frühzeitigen Trainingsausfällen. Um diese Probleme zu beheben, stellen wir Group reward-Decoupled Normalization Policy Optimization (GDPO) vor – eine neue Methode zur Politikoptimierung, die die Normalisierung einzelner Belohnungen entkoppelt. Dadurch werden die relativen Unterschiede zwischen den Belohnungen genauer bewahrt und eine präzisere Optimierung bei mehreren Belohnungen ermöglicht, wobei gleichzeitig eine erheblich verbesserte Trainingsstabilität erreicht wird. Wir vergleichen GDPO mit GRPO an drei Aufgaben: Werkzeugaufrufe, mathematische Schlussfolgerungen und Codierungsreasoning. Dabei evaluieren wir sowohl korrektheitsbezogene Metriken (Genauigkeit, Fehlerquote) als auch Metriken zur Einhaltung von Einschränkungen (Format, Länge). In allen Szenarien übertrifft GDPO GRPO konsistent, was seine Wirksamkeit und Verallgemeinerungsfähigkeit für die mehrfache Belohnungs-Optimierung im Rahmen des Verstärkungslernens belegt.

One-sentence Summary

The authors, affiliated with institutions including the University of Washington, University of Illinois Urbana-Champaign, and DeepSeek AI, propose GDPO, a novel policy optimization method for multi-reward reinforcement learning that decouples reward normalization per objective to preserve fine-grained advantage distinctions, overcoming the signal collapse issue in GRPO; this enables more accurate, stable, and generalizable training across tool calling, math reasoning, and coding tasks, significantly improving alignment with diverse human preferences.

Key Contributions

Multi-reward reinforcement learning (RL) for language models is increasingly important for aligning with diverse human preferences, but directly applying Group Relative Policy Optimization (GRPO) to multiple heterogeneous rewards causes reward combinations to collapse into identical advantage values, degrading training signal resolution and leading to unstable or failed training.
The proposed Group reward-Decoupled Normalization Policy Optimization (GDPO) addresses this by decoupling group-wise normalization per individual reward, preserving relative differences across reward dimensions, followed by batch-wise advantage normalization to maintain stable update magnitudes regardless of reward count.
GDPO consistently outperforms GRPO across three tasks—tool calling, math reasoning, and coding reasoning—showing higher accuracy (up to 6.3% on AIME), better format compliance, and improved training stability, demonstrating its effectiveness and generalizability in multi-reward RL.

Introduction

As language models grow more capable, aligning them with diverse human preferences—such as accuracy, safety, coherence, and format adherence—has become critical. This is increasingly achieved through multi-reward reinforcement learning (RL), where multiple reward signals guide policy optimization. However, prior work often applies Group Relative Policy Optimization (GRPO) directly to summed rewards, which collapses distinct reward combinations into identical advantage values due to group-wise normalization. This signal compression reduces training precision, degrades convergence, and can cause early training failure, especially in complex tasks involving multiple objectives.

To address this, the authors propose Group reward-Decoupled Normalization Policy Optimization (GDPO), which decouples the normalization of individual rewards before applying batch-wise advantage normalization. This preserves fine-grained differences across reward combinations, leading to more accurate and stable policy updates. GDPO maintains numerical stability regardless of the number of rewards and avoids the signal collapse inherent in GRPO.

Evaluated on tool calling, math reasoning, and code generation tasks, GDPO consistently outperforms GRPO in both correctness and constraint adherence metrics. For example, it achieves up to 6.3% higher accuracy on AIME math benchmarks while producing shorter, more concise responses. The results demonstrate GDPO’s effectiveness and generalizability as a robust alternative to GRPO in multi-reward RL.

Method

The authors propose Group reward-Decoupled normalization Policy Optimization (GDPO), a method designed to address the issue of reward collapse in multi-reward reinforcement learning when using Group Relative Policy Optimization (GRPO). The core idea of GDPO is to decouple the normalization process by applying group-wise normalization independently to each individual reward before aggregating them into a final advantage signal. This contrasts with GRPO, which first sums all rewards and then applies group-wise normalization to the aggregated sum.

Refer to the framework diagram . As shown in the figure below, the GRPO framework computes the total reward $r_{\text{sum}}^{(i,j)}$ by summing the individual rewards $r_1^{(i,j)}, \ldots, r_n^{(i,j)}$ for a given rollout. The advantage $A_{\text{sum}}^{(i,j)}$ is then calculated by normalizing this summed reward across all rollouts for the same question. In contrast, GDPO first normalizes each reward $r_k^{(i,j)}$ separately using the mean and standard deviation of its values across all $G$ rollouts for the $i^{th}$ question, resulting in normalized advantages $A_k^{(i,j)}$ . The overall advantage is then obtained by summing these normalized advantages: $A_{\text{sum}}^{(i,j)} = \sum_{k=1}^n A_k^{(i,j)}$ . This sum is subsequently normalized across the entire batch to ensure a stable numerical range, yielding the final advantage $\hat{A}_{\text{sum}}^{(i,j)}$ .

As shown in the figure below, this decoupled approach prevents the loss of information that occurs in GRPO. In the GRPO framework, distinct reward combinations, such as $(0,2)$ and $(0,1)$ , can be normalized to the same advantage value, thereby collapsing distinct reward signals into a single group. This is illustrated by the fact that both $(0,2)$ and $(0,1)$ map to the same GRPO Advantage $(-0.7071, 0.7071)$ , resulting in only two distinct groups of advantages. In contrast, GDPO normalizes each reward dimension separately, preserving the differences between these combinations. For example, the reward combination $(0,1)$ becomes $(-0.7071, 0.7071)$ and $(0,2)$ becomes $(-1.4142, 1.4142)$ , resulting in three distinct groups of advantages. This allows GDPO to provide a more expressive and accurate training signal that better reflects the relative differences between reward combinations.

Experiment

Evaluated GDPO against GRPO on tool-calling task (Sec. 4.1): GDPO achieved 2.7% higher average accuracy and over 4% better format correctness on BFCL-v3 compared to GRPO for Qwen2.5-1.5B, with consistent convergence to higher correctness and format rewards across five runs.
Ablation study on GRPO with/without standard deviation normalization: GRPO w/o std matched correctness gains but failed to converge on format reward, achieving 0% correct format ratio on BFCL-v3, indicating instability in multi-reward optimization.
Compared GDPO and GRPO on mathematical reasoning (Sec. 4.2): GDPO improved pass@1 accuracy by up to 6.7% on MATH and 3% on AIME, while reducing length-exceeding ratios to 0.1–0.2% (vs. 2.1–2.5% for GRPO), demonstrating superior trade-off between accuracy and efficiency.
Analyzed reward weight and conditioned reward impact: Reducing length reward weight alone had limited effect; conditioning the length reward on correctness significantly improved alignment, with GDPO achieving 4.4% higher accuracy on AIME and 16.9% lower length violations than GRPO.
Extended to three-reward coding reasoning (Sec. 4.3): GDPO achieved 2.6–3.3% higher pass rates on Codecontests and Taco with minimal increase in length violations, and reduced bug ratio while maintaining pass rate, showing strong generalization to three objectives.

The authors compare GDPO and GRPO on a mathematical reasoning task that optimizes accuracy and length constraints, finding that GDPO consistently achieves higher accuracy and better adherence to length limits. While both methods initially prioritize the easier length reward, GDPO recovers and improves correctness more effectively, leading to superior performance across all benchmarks.

The authors compare GDPO and GRPO on a coding reasoning task with three reward objectives: pass rate, length constraint, and bug ratio. Results show that GDPO consistently achieves a better balance across all objectives than GRPO, maintaining similar pass rates while significantly reducing both length-exceeding ratios and bug ratios.

Results show that GDPO consistently outperforms GRPO across all mathematical reasoning benchmarks, achieving higher accuracy and significantly lower length-exceeding ratios. The improvements are particularly notable on challenging tasks like AIME and Olympiad, where GDPO achieves up to 3% higher accuracy while reducing length violations to near zero.

The authors use GDPO to optimize two rewards—tool-calling correctness and format compliance—on the tool-calling task, and results show that GDPO consistently achieves higher correctness and format rewards compared to GRPO across five training runs. GDPO also outperforms GRPO in downstream evaluation on the BFCL-v3 benchmark, improving average tool-calling accuracy and format correctness by up to 5% and 4%, respectively.

Quell-PDF Code anzeigen

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

HyperAI

vor einem Monat

Verstärkendes Lernen

Präferenzmodellierung

Modelltraining

Ansatz/Rahmenwerk

Natürliche Sprachverarbeitung

Aufgabe

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng

Inhaltsverzeichnis

Zusammenfassung

One-sentence Summary

Key Contributions

Multi-reward reinforcement learning (RL) for language models is increasingly important for aligning with diverse human preferences, but directly applying Group Relative Policy Optimization (GRPO) to multiple heterogeneous rewards causes reward combinations to collapse into identical advantage values, degrading training signal resolution and leading to unstable or failed training.
The proposed Group reward-Decoupled Normalization Policy Optimization (GDPO) addresses this by decoupling group-wise normalization per individual reward, preserving relative differences across reward dimensions, followed by batch-wise advantage normalization to maintain stable update magnitudes regardless of reward count.
GDPO consistently outperforms GRPO across three tasks—tool calling, math reasoning, and coding reasoning—showing higher accuracy (up to 6.3% on AIME), better format compliance, and improved training stability, demonstrating its effectiveness and generalizability in multi-reward RL.

Introduction

Method

Experiment

Evaluated GDPO against GRPO on tool-calling task (Sec. 4.1): GDPO achieved 2.7% higher average accuracy and over 4% better format correctness on BFCL-v3 compared to GRPO for Qwen2.5-1.5B, with consistent convergence to higher correctness and format rewards across five runs.
Ablation study on GRPO with/without standard deviation normalization: GRPO w/o std matched correctness gains but failed to converge on format reward, achieving 0% correct format ratio on BFCL-v3, indicating instability in multi-reward optimization.
Compared GDPO and GRPO on mathematical reasoning (Sec. 4.2): GDPO improved pass@1 accuracy by up to 6.7% on MATH and 3% on AIME, while reducing length-exceeding ratios to 0.1–0.2% (vs. 2.1–2.5% for GRPO), demonstrating superior trade-off between accuracy and efficiency.
Analyzed reward weight and conditioned reward impact: Reducing length reward weight alone had limited effect; conditioning the length reward on correctness significantly improved alignment, with GDPO achieving 4.4% higher accuracy on AIME and 16.9% lower length violations than GRPO.
Extended to three-reward coding reasoning (Sec. 4.3): GDPO achieved 2.6–3.3% higher pass rates on Codecontests and Taco with minimal increase in length violations, and reduced bug ratio while maintaining pass rate, showing strong generalization to three objectives.

Quell-PDF Code anzeigen

Inhaltsverzeichnis

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren

Sofort einsatzbereite GPUs

Die besten Preise

Erste Schritte Preise anzeigen

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates

Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen

Unterstützt von MailChimp

Command Palette

GDPO: Gruppenbelohnungs-entkoppelte Normalisierung Policy Optimization für die Multi-Belohnungs-RL-Optimierung

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng3 more

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

GDPO: Gruppenbelohnungs-entkoppelte Normalisierung Policy Optimization für die Multi-Belohnungs-RL-Optimierung

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng3 more

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Command Palette

GDPO: Gruppenbelohnungs-entkoppelte Normalisierung Policy Optimization für die Multi-Belohnungs-RL-Optimierung

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng3 more

Zusammenfassung

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

KI mit KI entwickeln

HyperAI Newsletters

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Peter Belcak Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng