4달 전

Haoyou Deng Keyu Yan Chaojie Mao Xiang Wang Yu Liu Changxin Gao Nong Sang

초록

최근 흐름 매칭 모델 기반의 GRPO 접근법은 텍스트-to-이미지 생성에서 인간 선호도와의 일치도에 놀라운 개선을 보여주었다. 그러나 이러한 방법들은 여전히 희소 보상 문제(sparse reward problem)에 직면해 있다. 즉, 전체 노이즈 제거 경로의 종단 보상이 모든 중간 단계에 동일하게 적용되면서, 전반적인 피드백 신호와 중간 노이즈 제거 단계에서의 세밀한 기여도 간에 불일치가 발생한다. 이 문제를 해결하기 위해, 우리는 각 노이즈 제거 단계의 세밀한 기여도를 평가할 수 있는 밀도 높은 보상(dense reward)을 활용하여 인간 선호도를 정렬하는 새로운 프레임워크인 DenseGRPO를 제안한다. 구체적으로 본 연구는 두 가지 핵심 구성 요소를 포함한다. (1) 각 노이즈 제거 단계에 대한 단계별 보상 증가량(step-wise reward gain)을 밀도 높은 보상으로 예측하는 방법을 제안한다. 이는 중간 단계의 클린 이미지에 대해 ODE 기반 접근법을 활용해 보상 모델을 적용함으로써 구현되며, 피드백 신호와 각 단계의 기여도 간의 일치를 보장함으로써 효과적인 학습을 가능하게 한다. (2) 추정된 밀도 높은 보상 기반으로, 기존 GRPO 기반 방법에서 보이는 균일한 탐색 설정(uniform exploration setting)과 시간에 따라 변화하는 노이즈 강도 사이의 불일치가 드러나며, 이는 부적절한 탐색 공간을 초래한다. 이를 해결하기 위해, SDE 샘플러에서 타임스텝별로 적응적으로 확률적 노이즈 주입(stochasticity injection)을 조정함으로써 탐색 공간을 보상 인식 보정(reward-aware calibration)하는 새로운 전략을 제안한다. 이로써 모든 타임스텝에서 적절한 탐색 공간을 유지할 수 있다. 다양한 표준 벤치마크에서 수행된 광범위한 실험을 통해 DenseGRPO의 효과성을 입증하였으며, 특히 흐름 매칭 모델의 정렬 과정에서 유효한 밀도 높은 보상의 결정적 역할을 강조한다.

One-sentence Summary

Researchers from Huazhong University of Science and Technology and Tongyi Lab propose DenseGRPO, a framework using dense step-wise rewards to align human preferences in text-to-image generation, overcoming sparse reward issues by calibrating exploration via adaptive stochasticity, significantly improving flow matching model performance.

Key Contributions

DenseGRPO addresses the sparse reward problem in GRPO-based text-to-image models by introducing step-wise dense rewards estimated via an ODE-based method that evaluates intermediate clean images, aligning feedback with each denoising step’s contribution.
It reveals and corrects a mismatch between uniform exploration and time-varying noise intensity by proposing a reward-aware SDE sampler that adaptively adjusts stochasticity per timestep, ensuring balanced exploration across the denoising trajectory.
Experiments on multiple benchmarks confirm DenseGRPO’s state-of-the-art performance, validating the necessity of dense rewards for effective human preference alignment in flow matching models.

Introduction

The authors leverage flow matching models for text-to-image generation and address the persistent challenge of aligning them with human preferences using reinforcement learning. Prior GRPO-based methods suffer from sparse rewards—applying a single terminal reward to all denoising steps—which misaligns feedback with each step’s actual contribution, hindering fine-grained optimization. DenseGRPO introduces dense rewards by estimating step-wise reward gains via ODE-based intermediate image evaluation, ensuring feedback matches individual step contributions. It further calibrates exploration by adaptively adjusting stochasticity in the SDE sampler per timestep, correcting imbalance in reward distribution. Experiments confirm DenseGRPO’s state-of-the-art performance, validating the necessity of dense, step-aware rewards for effective alignment.

Dataset

The authors use only publicly available datasets, all compliant with their respective licenses, ensuring ethical adherence per the ICLR Code of Ethics.
No human subjects, sensitive personal data, or proprietary content are involved in the study.
The methods introduced carry no foreseeable risk of misuse or harm.
Dataset composition, subset details, training splits, mixture ratios, cropping strategies, or metadata construction are not described in the provided text.

Method

The authors leverage a reformulation of the denoising process in flow matching models as a Markov Decision Process (MDP) to enable reinforcement learning-based alignment. In this formulation, the state at timestep $t$ is defined as $\mathbf{s}_t \triangleq (\pmb{c}, t, \pmb{x}_t)$ , where $\pmb{c}$ is the prompt, $t$ is the current timestep, and $\pmb{x}_t$ is the latent representation. The action corresponds to the predicted prior latent $\pmb{x}_{t-1}$ , and the policy $\pi(\mathbf{a}_t \mid \mathbf{s}_t)$ is modeled as the conditional distribution $p(\pmb{x}_{t-1} \mid \pmb{x}_t, \pmb{c})$ . The reward is sparse and trajectory-wise, assigned only at the terminal state $t=0$ as $\mathcal{R}(\pmb{x}_0, \pmb{c})$ , while intermediate steps receive zero reward. This design leads to a mismatch: the same terminal reward is used to optimize all timesteps, ignoring the distinct contributions of individual denoising steps.

To resolve this, DenseGRPO introduces a step-wise dense reward mechanism. Instead of relying on a single terminal reward, the method estimates a reward $R_t^i$ for each intermediate latent $\pmb{x}_t^i$ along the trajectory. This is achieved by leveraging the deterministic nature of the ODE denoising process: given $\pmb{x}_t^i$ , the model can deterministically generate the corresponding clean latent $\hat{\pmb{x}}_{t,0}^i$ via $n$ -step ODE denoising, i.e., $\hat{\pmb{x}}_{t,0}^i = \mathrm{ODE}_n(\pmb{x}_t^i, \pmb{c})$ . The reward for $\pmb{x}_t^i$ is then assigned as $R_t^i \triangleq \mathcal{R}(\hat{\pmb{x}}_{t,0}^i, \pmb{c})$ , where $\mathcal{R}$ is a pre-trained reward model. The step-wise dense reward $\Delta R_t^i$ is defined as the reward gain between consecutive steps: $\Delta R_t^i = R_{t-1}^i - R_t^i$ . This formulation provides a fine-grained, step-specific feedback signal that reflects the actual contribution of each denoising action.

Refer to the framework diagram illustrating the dense reward estimation process. The diagram shows how, for each latent $\pmb{x}_t^i$ , an ODE denoising trajectory is computed to obtain the clean counterpart $\hat{\pmb{x}}_{t,0}^i$ , which is then evaluated by the reward model to yield $R_t^i$ . The dense reward $\Delta R_t^i$ is derived from the difference between successive latent rewards, enabling per-timestep credit assignment.

In the GRPO training loop, the dense reward $\Delta R_t^i$ replaces the sparse terminal reward in the advantage computation. The advantage for the $i$ -th trajectory at timestep $t$ is recalculated as:

\hat{A}_t^i = \frac{\Delta R_t^i - \mathrm{mean}(\{\Delta R_t^i\}_{i=1}^G)}{\mathrm{std}(\{\Delta R_t^i\}_{i=1}^G)}.

This ensures that the policy update at each timestep is guided by a reward signal that reflects the immediate contribution of that step, rather than the global trajectory outcome.

To further enhance exploration during training, DenseGRPO introduces a reward-aware calibration of the SDE sampler’s noise injection. While existing methods use a uniform noise level $a$ across all timesteps, DenseGRPO adapts the noise intensity $\psi(t)$ per timestep to maintain a balanced exploration space. The calibration is performed iteratively: for each timestep $t$ , the algorithm samples trajectories, computes dense rewards $\Delta R_t^i$ , and adjusts $\psi(t)$ based on the balance between positive and negative rewards. If the number of positive and negative rewards is approximately equal, $\psi(t)$ is increased to encourage diversity; otherwise, it is decreased to restore balance. The calibrated noise schedule $\psi(t)$ is then used in the SDE sampler, replacing the original $\sigma_t = a\sqrt{t/(1-t)}$ with $\sigma_t = \psi(t)$ , thereby tailoring the stochasticity to the time-varying nature of the denoising process.

Experiment

DenseGRPO outperforms Flow-GRPO and Flow-GRPO+CoCA across compositional image generation, human preference alignment, and visual text rendering, demonstrating superior alignment with target preferences through step-wise dense rewards.
Ablation studies confirm that step-wise dense rewards significantly improve policy optimization over sparse trajectory rewards, and time-specific noise calibration enhances exploration effectiveness.
Increasing ODE denoising steps improves reward accuracy and model performance, despite higher computational cost, validating that precise reward estimation is critical for alignment.
DenseGRPO generalizes beyond flow matching models to diffusion models and higher resolutions, maintaining performance gains via deterministic sampling for accurate latent reward prediction.
While achieving strong alignment, DenseGRPO shows slight susceptibility to reward hacking in specific tasks, suggesting a trade-off between reward precision and robustness that may be mitigated with higher-quality reward models.

The authors use DenseGRPO to introduce step-wise dense rewards in text-to-image generation, achieving consistent improvements over Flow-GRPO and Flow-GRPO+CoCA across compositional, text-rendering, and human preference tasks. Results show that dense reward signals better align feedback with individual denoising steps, enhancing both semantic accuracy and visual quality, while ablation studies confirm the importance of calibrated exploration and multi-step ODE denoising for reward accuracy. Although some reward hacking occurs under specific metrics, DenseGRPO demonstrates robustness and generalizability across model architectures and resolutions.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Haoyou Deng Keyu Yan Chaojie Mao Xiang Wang Yu Liu Changxin Gao Nong Sang

초록

One-sentence Summary

Key Contributions

DenseGRPO addresses the sparse reward problem in GRPO-based text-to-image models by introducing step-wise dense rewards estimated via an ODE-based method that evaluates intermediate clean images, aligning feedback with each denoising step’s contribution.
It reveals and corrects a mismatch between uniform exploration and time-varying noise intensity by proposing a reward-aware SDE sampler that adaptively adjusts stochasticity per timestep, ensuring balanced exploration across the denoising trajectory.
Experiments on multiple benchmarks confirm DenseGRPO’s state-of-the-art performance, validating the necessity of dense rewards for effective human preference alignment in flow matching models.

Introduction

Dataset

The authors use only publicly available datasets, all compliant with their respective licenses, ensuring ethical adherence per the ICLR Code of Ethics.
No human subjects, sensitive personal data, or proprietary content are involved in the study.
The methods introduced carry no foreseeable risk of misuse or harm.
Dataset composition, subset details, training splits, mixture ratios, cropping strategies, or metadata construction are not described in the provided text.

Method

\hat{A}_t^i = \frac{\Delta R_t^i - \mathrm{mean}(\{\Delta R_t^i\}_{i=1}^G)}{\mathrm{std}(\{\Delta R_t^i\}_{i=1}^G)}.

This ensures that the policy update at each timestep is guided by a reward signal that reflects the immediate contribution of that step, rather than the global trajectory outcome.

Experiment

DenseGRPO outperforms Flow-GRPO and Flow-GRPO+CoCA across compositional image generation, human preference alignment, and visual text rendering, demonstrating superior alignment with target preferences through step-wise dense rewards.
Ablation studies confirm that step-wise dense rewards significantly improve policy optimization over sparse trajectory rewards, and time-specific noise calibration enhances exploration effectiveness.
Increasing ODE denoising steps improves reward accuracy and model performance, despite higher computational cost, validating that precise reward estimation is critical for alignment.
DenseGRPO generalizes beyond flow matching models to diffusion models and higher resolutions, maintaining performance gains via deterministic sampling for accurate latent reward prediction.
While achieving strong alignment, DenseGRPO shows slight susceptibility to reward hacking in specific tasks, suggesting a trade-off between reward precision and robustness that may be mitigated with higher-quality reward models.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

DenseGRPO: 희소 보상에서 밀도 보상으로의 전환을 통한 흐름 매칭 모델 정렬

Haoyou Deng Keyu Yan Chaojie Mao Xiang Wang Yu Liu Changxin Gao Nong Sang

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

DenseGRPO: 희소 보상에서 밀도 보상으로의 전환을 통한 흐름 매칭 모델 정렬

Haoyou Deng Keyu Yan Chaojie Mao Xiang Wang Yu Liu Changxin Gao Nong Sang

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

DenseGRPO: 희소 보상에서 밀도 보상으로의 전환을 통한 흐름 매칭 모델 정렬

Haoyou Deng Keyu Yan Chaojie Mao Xiang Wang Yu Liu Changxin Gao Nong Sang

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters