HyperAIHyperAI

Command Palette

Search for a command to run...

DenseGRPO:フローマッチングモデルの整合性向上のためのスパースからディンスな報酬へ

Haoyou Deng Keyu Yan Chaojie Mao Xiang Wang Yu Liu Changxin Gao Nong Sang

概要

最近、フロー・マッチングモデルを基盤とするGRPO(Generalized Reward Policy Optimization)に基づくアプローチは、テキストから画像生成を行う際の人間の好みとの整合性において顕著な進展を示している。しかし、これらの手法は依然として「報酬の疎らさ(sparse reward)」という問題に直面している。具体的には、ノイズ除去の全過程における終端報酬がすべての中間ステップに一様に適用されるため、グローバルなフィードバック信号と、中間段階における細分化されたノイズ除去の貢献度との間に不一致が生じる。この問題に対処するため、本研究では、細分化された報酬を用いた人間の好みとの整合性を実現する新しいフレームワーク「DenseGRPO」を提案する。本手法は、各ノイズ除去ステップの細分化された貢献度を評価可能な「密度付き報酬(dense rewards)」を導入することを特徴とする。具体的には、以下の2つの鍵となる要素を含む:(1)各ステップにおける報酬増加量(step-wise reward gain)を予測し、これを各中間段階のクリーン画像に対して、ODE(常微分方程式)に基づくアプローチを用いて報酬モデルに適用することで、密度付き報酬を定義する。この方法により、フィードバック信号と各ステップの実際の貢献度との整合性が確保され、効果的な学習が可能となる;(2)推定された密度付き報酬に基づき、従来のGRPOベース手法における「均一な探索設定」と「時間依存のノイズ強度変化」との間に生じる不整合が明らかになった。これにより、探索空間が不適切になる問題が発生する。そこで、本研究では、SDE(確率微分方程式)サンプラーにおける時刻ごとの確率的ノイズ注入量を適応的に調整する「報酬に配慮した探索空間の補正スキーム」を提案し、すべての時刻において適切な探索空間を維持することを実現する。複数の標準ベンチマーク上での広範な実験により、提案手法DenseGRPOの有効性が実証されるとともに、フロー・マッチングモデルとの整合性において、正確な密度付き報酬の重要性が強調された。

One-sentence Summary

Researchers from Huazhong University of Science and Technology and Tongyi Lab propose DenseGRPO, a framework using dense step-wise rewards to align human preferences in text-to-image generation, overcoming sparse reward issues by calibrating exploration via adaptive stochasticity, significantly improving flow matching model performance.

Key Contributions

  • DenseGRPO addresses the sparse reward problem in GRPO-based text-to-image models by introducing step-wise dense rewards estimated via an ODE-based method that evaluates intermediate clean images, aligning feedback with each denoising step’s contribution.
  • It reveals and corrects a mismatch between uniform exploration and time-varying noise intensity by proposing a reward-aware SDE sampler that adaptively adjusts stochasticity per timestep, ensuring balanced exploration across the denoising trajectory.
  • Experiments on multiple benchmarks confirm DenseGRPO’s state-of-the-art performance, validating the necessity of dense rewards for effective human preference alignment in flow matching models.

Introduction

The authors leverage flow matching models for text-to-image generation and address the persistent challenge of aligning them with human preferences using reinforcement learning. Prior GRPO-based methods suffer from sparse rewards—applying a single terminal reward to all denoising steps—which misaligns feedback with each step’s actual contribution, hindering fine-grained optimization. DenseGRPO introduces dense rewards by estimating step-wise reward gains via ODE-based intermediate image evaluation, ensuring feedback matches individual step contributions. It further calibrates exploration by adaptively adjusting stochasticity in the SDE sampler per timestep, correcting imbalance in reward distribution. Experiments confirm DenseGRPO’s state-of-the-art performance, validating the necessity of dense, step-aware rewards for effective alignment.

Dataset

  • The authors use only publicly available datasets, all compliant with their respective licenses, ensuring ethical adherence per the ICLR Code of Ethics.
  • No human subjects, sensitive personal data, or proprietary content are involved in the study.
  • The methods introduced carry no foreseeable risk of misuse or harm.
  • Dataset composition, subset details, training splits, mixture ratios, cropping strategies, or metadata construction are not described in the provided text.

Method

The authors leverage a reformulation of the denoising process in flow matching models as a Markov Decision Process (MDP) to enable reinforcement learning-based alignment. In this formulation, the state at timestep ttt is defined as st(c,t,xt)\mathbf{s}_t \triangleq (\pmb{c}, t, \pmb{x}_t)st(c,t,xt), where c\pmb{c}c is the prompt, ttt is the current timestep, and xt\pmb{x}_txt is the latent representation. The action corresponds to the predicted prior latent xt1\pmb{x}_{t-1}xt1, and the policy π(atst)\pi(\mathbf{a}_t \mid \mathbf{s}_t)π(atst) is modeled as the conditional distribution p(xt1xt,c)p(\pmb{x}_{t-1} \mid \pmb{x}_t, \pmb{c})p(xt1xt,c). The reward is sparse and trajectory-wise, assigned only at the terminal state t=0t=0t=0 as R(x0,c)\mathcal{R}(\pmb{x}_0, \pmb{c})R(x0,c), while intermediate steps receive zero reward. This design leads to a mismatch: the same terminal reward is used to optimize all timesteps, ignoring the distinct contributions of individual denoising steps.

To resolve this, DenseGRPO introduces a step-wise dense reward mechanism. Instead of relying on a single terminal reward, the method estimates a reward RtiR_t^iRti for each intermediate latent xti\pmb{x}_t^ixti along the trajectory. This is achieved by leveraging the deterministic nature of the ODE denoising process: given xti\pmb{x}_t^ixti, the model can deterministically generate the corresponding clean latent x^t,0i\hat{\pmb{x}}_{t,0}^ix^t,0i via nnn-step ODE denoising, i.e., x^t,0i=ODEn(xti,c)\hat{\pmb{x}}_{t,0}^i = \mathrm{ODE}_n(\pmb{x}_t^i, \pmb{c})x^t,0i=ODEn(xti,c). The reward for xti\pmb{x}_t^ixti is then assigned as RtiR(x^t,0i,c)R_t^i \triangleq \mathcal{R}(\hat{\pmb{x}}_{t,0}^i, \pmb{c})RtiR(x^t,0i,c), where R\mathcal{R}R is a pre-trained reward model. The step-wise dense reward ΔRti\Delta R_t^iΔRti is defined as the reward gain between consecutive steps: ΔRti=Rt1iRti\Delta R_t^i = R_{t-1}^i - R_t^iΔRti=Rt1iRti. This formulation provides a fine-grained, step-specific feedback signal that reflects the actual contribution of each denoising action.

Refer to the framework diagram illustrating the dense reward estimation process. The diagram shows how, for each latent xti\pmb{x}_t^ixti, an ODE denoising trajectory is computed to obtain the clean counterpart x^t,0i\hat{\pmb{x}}_{t,0}^ix^t,0i, which is then evaluated by the reward model to yield RtiR_t^iRti. The dense reward ΔRti\Delta R_t^iΔRti is derived from the difference between successive latent rewards, enabling per-timestep credit assignment.

In the GRPO training loop, the dense reward ΔRti\Delta R_t^iΔRti replaces the sparse terminal reward in the advantage computation. The advantage for the iii-th trajectory at timestep ttt is recalculated as:

A^ti=ΔRtimean({ΔRti}i=1G)std({ΔRti}i=1G).\hat{A}_t^i = \frac{\Delta R_t^i - \mathrm{mean}(\{\Delta R_t^i\}_{i=1}^G)}{\mathrm{std}(\{\Delta R_t^i\}_{i=1}^G)}.A^ti=std({ΔRti}i=1G)ΔRtimean({ΔRti}i=1G).

This ensures that the policy update at each timestep is guided by a reward signal that reflects the immediate contribution of that step, rather than the global trajectory outcome.

To further enhance exploration during training, DenseGRPO introduces a reward-aware calibration of the SDE sampler’s noise injection. While existing methods use a uniform noise level aaa across all timesteps, DenseGRPO adapts the noise intensity ψ(t)\psi(t)ψ(t) per timestep to maintain a balanced exploration space. The calibration is performed iteratively: for each timestep ttt, the algorithm samples trajectories, computes dense rewards ΔRti\Delta R_t^iΔRti, and adjusts ψ(t)\psi(t)ψ(t) based on the balance between positive and negative rewards. If the number of positive and negative rewards is approximately equal, ψ(t)\psi(t)ψ(t) is increased to encourage diversity; otherwise, it is decreased to restore balance. The calibrated noise schedule ψ(t)\psi(t)ψ(t) is then used in the SDE sampler, replacing the original σt=at/(1t)\sigma_t = a\sqrt{t/(1-t)}σt=at/(1t) with σt=ψ(t)\sigma_t = \psi(t)σt=ψ(t), thereby tailoring the stochasticity to the time-varying nature of the denoising process.

Experiment

  • DenseGRPO outperforms Flow-GRPO and Flow-GRPO+CoCA across compositional image generation, human preference alignment, and visual text rendering, demonstrating superior alignment with target preferences through step-wise dense rewards.
  • Ablation studies confirm that step-wise dense rewards significantly improve policy optimization over sparse trajectory rewards, and time-specific noise calibration enhances exploration effectiveness.
  • Increasing ODE denoising steps improves reward accuracy and model performance, despite higher computational cost, validating that precise reward estimation is critical for alignment.
  • DenseGRPO generalizes beyond flow matching models to diffusion models and higher resolutions, maintaining performance gains via deterministic sampling for accurate latent reward prediction.
  • While achieving strong alignment, DenseGRPO shows slight susceptibility to reward hacking in specific tasks, suggesting a trade-off between reward precision and robustness that may be mitigated with higher-quality reward models.

The authors use DenseGRPO to introduce step-wise dense rewards in text-to-image generation, achieving consistent improvements over Flow-GRPO and Flow-GRPO+CoCA across compositional, text-rendering, and human preference tasks. Results show that dense reward signals better align feedback with individual denoising steps, enhancing both semantic accuracy and visual quality, while ablation studies confirm the importance of calibrated exploration and multi-step ODE denoising for reward accuracy. Although some reward hacking occurs under specific metrics, DenseGRPO demonstrates robustness and generalizability across model architectures and resolutions.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています