HyperAIHyperAI

Command Palette

Search for a command to run...

Atténuation des récompenses rares par modélisation des effets d’échantillonnage étape-par-étape et à long terme dans le GRPO fondé sur les flux

Résumé

Le déploiement de GRPO sur des modèles de flow matching s’est avéré efficace pour la génération d’images à partir de texte. Toutefois, les paradigmes existants propagent généralement une récompense basée sur le résultat à toutes les étapes de débruitage précédentes, sans distinguer l’effet local de chaque étape. En outre, le classement par groupes actuel compare principalement les trajectoires à des instants temporels correspondants, tout en ignorant les dépendances intra-trajectoire, où certaines actions précoces de débruitage peuvent influencer les états ultérieurs par des interactions implicites et différées. Nous proposons TurningPoint-GRPO (TP-GRPO), un cadre GRPO qui atténue la sparsité des récompenses par étape et modélise explicitement les effets à long terme au sein de la trajectoire de débruitage. TP-GRPO introduit deux innovations majeures : (i) il remplace les récompenses basées sur le résultat par des récompenses incrémentielles au niveau de chaque étape, fournissant ainsi un signal d’apprentissage dense et sensible à l’étape, qui permet une meilleure isolation de l’effet « pur » de chaque action de débruitage ; et (ii) il identifie des points de bascule — des étapes marquant un changement de tendance locale de la récompense et rendant l’évolution ultérieure de la récompense cohérente avec la tendance globale de la trajectoire — et attribue à ces actions une récompense accumulée à long terme afin de capturer leur impact différé. Les points de bascule sont détectés uniquement à partir des changements de signe dans les récompenses incrémentielles, ce qui rend TP-GRPO à la fois efficace et libre de paramètres hyper. Des expériences étendues démontrent également que TP-GRPO exploite plus efficacement les signaux de récompense et améliore de manière cohérente la qualité de génération. Le code de démonstration est disponible à l’adresse suivante : https://github.com/YunzeTong/TurningPoint-GRPO.

One-sentence Summary

Researchers from institutions including Tsinghua and Alibaba propose TP-GRPO, a GRPO variant that replaces sparse outcome rewards with dense step-level signals and identifies turning points to capture delayed effects, improving text-to-image generation by better modeling denoising trajectory dynamics without hyperparameters.

Key Contributions

  • TP-GRPO replaces sparse outcome-based rewards with step-level incremental rewards to isolate the pure effect of each denoising action, reducing reward sparsity and improving credit assignment during RL fine-tuning of Flow Matching models.
  • It introduces turning points—steps identified by sign changes in incremental rewards—that flip local reward trends and are assigned aggregated long-term rewards to explicitly model delayed, within-trajectory dependencies critical for coherent image generation.
  • Evaluated on standard text-to-image benchmarks, TP-GRPO consistently outperforms prior GRPO methods like Flow-GRPO and DanceGRPO by better exploiting reward signals, with no added hyperparameters and efficient implementation.

Introduction

The authors leverage flow-based generative models and reinforcement learning to address sparse, misaligned rewards in text-to-image generation. Prior methods like Flow-GRPO assign the final image reward uniformly across all denoising steps, ignoring step-specific contributions and within-trajectory dynamics — leading to reward sparsity and local-global misalignment. To fix this, they introduce TurningPoint-GRPO, which computes step-wise rewards via incremental reward differences and explicitly models long-term effects of critical “turning point” steps that reverse local reward trends, enabling more accurate credit assignment without extra hyperparameters.

Method

The authors leverage a modified GRPO framework, TurningPoint-GRPO (TP-GRPO), to address reward sparsity and misalignment in flow matching-based text-to-image generation. The core innovation lies in replacing outcome-based rewards with step-level incremental rewards and explicitly modeling long-term effects via turning points—steps that flip the local reward trend to align with the global trajectory trend. This design enables more precise credit assignment across the denoising trajectory.

The method begins by sampling diverse trajectories using an SDE-based sampler, which injects stochasticity into the reverse-time denoising process. For each trajectory, intermediate latents are cached, and ODE sampling is applied from each latent to completion to obtain corresponding clean images. This allows the reward model to evaluate the cumulative effect of all preceding SDE steps. The step-wise reward rtr_trt for the transition from xtx_txt to xt1x_{t-1}xt1 is then computed as the difference in reward between the ODE-completed images: rt=R(xt1ODE(t1))R(xtODE(t))r_t = R(x_{t-1}^{\text{ODE}(t-1)}) - R(x_t^{\text{ODE}(t)})rt=R(xt1ODE(t1))R(xtODE(t)). This incremental reward isolates the “pure” effect of each denoising action, providing a dense, step-aware signal that avoids the sparsity inherent in propagating a single terminal reward.

Turning points are identified based on sign changes in these incremental rewards. A timestep ttt qualifies as a turning point if the local reward trend flips to become consistent with the overall trajectory trend. Specifically, the authors define a turning point using the sign consistency between the local step gain and the cumulative gain from that step to the end. As shown in the figure below, turning points are visually characterized by a local reversal in reward direction that aligns with the global trajectory trend, distinguishing them from normal points that either do not flip or misalign with the global direction.

At identified turning points, the local step reward rtr_trt is replaced with an aggregated reward rtagg=R(x0)R(xtODE(t))r_t^{\text{agg}} = R(x_0) - R(x_t^{\text{ODE}(t)})rtagg=R(x0)R(xtODE(t)), which captures the cumulative effect from the turning point to the final image. This aggregated reward encodes the delayed, implicit impact of the denoising action on subsequent steps. The authors further refine this by introducing a stricter criterion—consistent turning points—that requires the aggregated reward to have a larger absolute value than the local reward, ensuring that only steps with significant long-term influence are selected.

To address the exclusion of the initial denoising step from turning point detection, the authors extend the framework via Remark 5.2. The first step is eligible for aggregated reward assignment if its local reward change aligns with the overall trajectory trend. This ensures that early, influential decisions are also modeled for their long-term impact.

The overall training process follows a group-wise ranking scheme. For each timestep, rewards (either rtr_trt or rtaggr_t^{\text{agg}}rtagg) are normalized across a group of trajectories to compute advantages. The policy is then optimized using a clipped objective that includes a KL regularization term to prevent excessive deviation from the reference policy. A balancing strategy is employed to maintain a roughly equal number of positive and negative aggregated rewards in each batch, preventing optimization bias.

Refer to the framework diagram for a visual summary of the method. The diagram illustrates how step-wise rewards are computed, turning points are identified, and aggregated rewards are assigned to capture long-term effects. It also highlights the inclusion of the initial step via Remark 5.2, ensuring comprehensive modeling of implicit interactions across the entire denoising trajectory.

Experiment

  • TP-GRPO variants outperform Flow-GRPO across compositional image generation, visual text rendering, and human preference alignment, with improved accuracy, aesthetics, and content alignment without reward hacking.
  • Training without KL penalty confirms TP-GRPO’s stronger exploratory capability and faster convergence, especially on non-rule-based rewards like PickScore.
  • Reducing SDE sampling window size moderately (e.g., to 8 steps) improves efficiency and performance, but overly aggressive reduction harms turning-point capture.
  • Noise scale α around 0.7 yields optimal stability; both lower and higher values degrade performance, though TP-GRPO remains robust across settings.
  • Method generalizes to FLUX.1-dev base model, maintaining superior performance over Flow-GRPO under adjusted hyperparameters.
  • Qualitative results show TP-GRPO better handles sparse rule-based rewards, avoids text omissions/overlaps, and produces more semantically coherent and aesthetically aligned outputs.

The authors use TP-GRPO to refine diffusion model training by incorporating step-level rewards and turning-point detection, which consistently improves performance across compositional image generation, visual text rendering, and human preference alignment tasks. Results show that both variants of TP-GRPO outperform Flow-GRPO in task-specific metrics while maintaining or enhancing image quality and preference scores. The method also demonstrates faster convergence and robustness across different base models and hyperparameter settings.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp