HyperAIHyperAI

Command Palette

Search for a command to run...

Qwen-Image-2.0-RL 技術報告

概要

本論文では、Qwen-Image-2.0拡散モデルの視覚品質と指示追従能力を向上させるために、人間フィードバックからの強化学習(RLHF)とオンポリシー蒸留(OPD)を適用するポストトレーニングパイプラインであるQwen-Image-2.0-RLを紹介する。信頼性の高い報酬信号を提供するために、ポイントワイズスコアリングパラダイムとチェーンオブソート推論を用いて視覚言語モデルをファインチューニングすることで、タスク固有の複合報酬モデルを構築する。テキストから画像への生成では、報酬モデルは整合性、美的品質、ポートレート忠実度の次元をカバーする。画像編集タスクでは、報酬システムは指示追従の正確さと顔の同一性保持に対処する。この報酬システムに基づき、スケーラブルなGRPOベースのRL訓練フレームワークを開発し、事前学習知識を保持するためのハイブリッド分類器フリーガイダンス(CFG)戦略、グループ内報酬範囲フィルタリングによるプロンプトキュレーション、カテゴリごとの報酬重み較正を組み込んでいる。T2Iと編集のためのタスク特化RLポリシーを統合するために、最終訓練段階としてオンポリシー蒸留を提案し、軌跡レベルの速度マッチングを通じて複数の教師モデルを単一の生徒モデルに統合する。広範な評価により、Qwen-Image-2.0-RLはQwen-Image-Benchで全体スコア57.84(ベースモデル比+2.61)、テキストから画像へのアリーナでEloレーティング1193(+78)、画像編集アリーナで1349(+93)を達成し、美的品質、プロンプト忠実度、編集精度において一貫した向上を示している。

One-sentence Summary

The authors propose Qwen-Image-2.0-RL, a post-training pipeline that applies reinforcement learning from human feedback (RLHF) with composite reward models fine-tuned via pointwise scoring and chain-of-thought reasoning and on-policy distillation through trajectory-level velocity matching, unifying text-to-image and image editing improvements and achieving a +2.61+2.61+2.61 overall score on Qwen-Image-Bench and Elo ratings of 119311931193 for text-to-image and 134913491349 for image editing.

Key Contributions

  • A VLM-based composite reward system is constructed for both text-to-image and image editing tasks by fine-tuning vision-language models with pointwise scoring and chain-of-thought reasoning, covering aesthetic quality, prompt adherence, portrait fidelity, instruction-following, and visual consistency.
  • A scalable GRPO-based RL training framework is developed with a hybrid CFG strategy that applies guidance during rollout sampling only, prompt curation via intra-group reward range filtering, and per-category reward weight calibration to stabilize large-scale flow matching model training.
  • On-policy distillation is proposed to unify task-specialized RL teachers into a single deployment model through trajectory-level velocity matching, eliminating reward model dependency and cross-task conflicts; the resulting Qwen-Image-2.0-RL surpasses a mixed RL baseline and achieves a 57.84 overall score on Qwen-Image-Bench (+2.61) with Elo ratings of 1193 in the T2I arena (+78) and 1349 in the image edit arena (+93).

Introduction

The authors tackle the gap between diffusion models’ impressive generation capabilities and human aesthetic preferences, a gap left by supervised denoising objectives that do not capture nuanced qualities like composition, texture, or prompt faithfulness. While reinforcement learning from human feedback (RLHF) offers a path to alignment, applying it to diffusion models presents three key challenges: designing reliable, task‑aware reward signals that span text‑to‑image and image editing tasks; scaling RLHF to full‑parameter training with multiple rewards; and consolidating task‑specialized policies into a single model without quality loss. The authors address these through a unified post‑training pipeline built on a chain‑of‑thought‑enabled VLM reward suite, a scalable GRPO‑based RL framework with hybrid classifier‑free guidance and prompt‑reward balancing, and an on‑policy distillation method that merges specialized teachers via trajectory‑level velocity matching, eliminating reward model dependency at deployment.

Dataset

The authors construct two annotated datasets to compare reward model training paradigms, deliberately sharing the same evaluation focus (aesthetic quality and visual texture) while differing only in annotation format. Both datasets draw images from a common pool of state-of-the-art AIGC models, generated from prompts that originate from high-quality portrait reference images and are linguistically diversified to go beyond the training distribution.

Dataset composition and sources

  • All images come from a single pool of AIGC-generated outputs, ensuring the comparison isolates the effect of the training paradigm rather than data distribution.
  • Prompts are randomly sampled from high-quality portrait reference images and rewritten to increase linguistic variety.
  • No explicit dataset sizes are reported in this excerpt; the focus is on annotation structure and training use.

Pointwise subset

  • Each image is independently scored by human annotators on a 5-point Likert scale along two dimensions: Quality (clarity, lighting, color balance, stylistic coherence, material texture) and Fidelity (structural correctness, physical consistency, absence of AI artifacts).
  • The resulting dataset provides absolute scores per image, used to train a reward model with a pointwise regression loss (the model outputs a discrete score distribution and the reward is the expectation).

Pairwise subset

  • Image pairs are generated from the same prompt and presented side by side for preference judgment.
  • Annotation follows a strict priority hierarchy: image-text consistency > structural distortion > texture quality > aesthetic appeal. When both images are free of distortions and consistent with the prompt, the comparison effectively reduces to texture quality and aesthetic appeal.
  • Each pair is further labeled with auxiliary attributes: sample validity, presence of text distortion, and presence of human figure distortion, supporting fine-grained analysis.
  • The resulting dataset provides preference pairs, used to train a reward model with the Bradley-Terry pairwise ranking loss.

How the data is used

  • Both subsets are used to fine-tune the same VLM architecture as reward models, one with pairwise loss and the other with pointwise regression.
  • The pointwise-trained reward model consistently yields better visual quality and fewer artifacts in downstream RL-based image generation, leading the authors to adopt the pointwise paradigm for all VLM-based reward models in the final system.

Processing details

  • No cropping strategy or additional image preprocessing is mentioned.
  • Metadata construction: pointwise data stores per-image Quality and Fidelity scores; pairwise data stores preference labels and the auxiliary attributes listed above.

Method

The authors leverage a comprehensive reinforcement learning framework to align diffusion models with human preferences, centered around a robust reward modeling system and a specialized training pipeline.

Reward Modeling To capture human preferences, the authors construct task-specific composite reward models. They investigate two training paradigms for fine-tuning Vision-Language Models (VLMs) as reward scorers: pairwise ranking and pointwise regression. In the pairwise approach, the model minimizes a Bradley-Terry ranking loss on preferred versus non-preferred image pairs. Conversely, the pointwise paradigm trains the model to regress directly to absolute human-annotated scores. The authors determine that pointwise training provides a richer supervisory signal, resulting in generated images with consistently better visual quality, finer texture detail, and fewer artifacts compared to the pairwise approach.

Training Pipeline and Hybrid CFG Strategy The training process extends Group Relative Policy Optimization (GRPO) to flow matching models. A critical architectural decision involves the application of Classifier-Free Guidance (CFG). The authors systematically evaluate three strategies: applying CFG in both rollout and training, removing it from both stages, or employing a hybrid approach. Applying CFG during training leads to severe instability and image collapse, while omitting it entirely causes the model to lose stylization capabilities and world knowledge. Consequently, the authors adopt a hybrid strategy where CFG is utilized exclusively during the rollout phase to generate high-quality candidates for reward evaluation, while the policy optimization objective remains CFG-free to maintain stable gradient updates.

The pipeline incorporates an asynchronous reward mechanism to decouple scoring from training, effectively hiding network latency. The group-relative advantage is calculated via weighted summation with per-prompt-group normalization: A(x0(i),c)=k=1KwkRk(x0(i),c)μkσkA(\boldsymbol{x}_0^{(i)}, c) = \sum_{k=1}^K w_k \cdot \frac{R_k(\boldsymbol{x}_0^{(i)}, c) - \mu_k}{\sigma_k}A(x0(i),c)=k=1KwkσkRk(x0(i),c)μk This ensures the composite reward is invariant to absolute scale differences across reward models. Task-specific optimizations include restricting training signals to high-noise timesteps to prevent reward hacking and curating prompts that exhibit sufficient reward variance to ensure meaningful policy improvement.

On-Policy Distillation To address the limitation that RL optimization produces task-specialized policies (e.g., separate models for Text-to-Image generation and image editing), the authors introduce On-Policy Distillation (OPD). This method unifies multiple task-specialized teacher models into a single student model via trajectory-level velocity matching. The student model generates a full denoising trajectory and is trained to match the velocity predictions of the task-appropriate teacher at each step: LOPD=Ec,x[1:N]πθ(c)[n=1Nvθ(xtn,tn,c)vθ(xtn,tn,c)2]\mathcal{L}_{\mathrm{OPD}} = \mathbb{E}_{c, \boldsymbol{x}_{[1:N]} \sim \pi_{\theta}(\cdot|c)} \left[ \sum_{n=1}^N \left\| \boldsymbol{v}_{\theta}(\boldsymbol{x}_{t_n}, t_n, c) - \boldsymbol{v}_{\theta^*}(\boldsymbol{x}_{t_n}, t_n, c) \right\|^2 \right]LOPD=Ec,x[1:N]πθ(c)[n=1Nvθ(xtn,tn,c)vθ(xtn,tn,c)2] This decomposed strategy avoids the optimization conflicts inherent in jointly training on mixed tasks (Mix-RL). The resulting unified model achieves superior prompt adherence, aesthetic quality, and identity preservation compared to both the pre-trained base model and the Mix-RL baseline.

For image editing tasks, the framework integrates a dedicated model-based face identity consistency scorer alongside VLM-based instruction-following rewards. This ensures that fine-grained identity features are preserved during complex edits, such as style transfer or attribute modification, allowing the model to accurately execute instructions while maintaining the subject's original characteristics.

Experiment

The evaluation spans automated quality metrics and human preference arenas. RL training consistently improves Qwen-Image-2.0 across all dimensions, with the largest boosts in creative generation, real-world fidelity, 3D modeling, and photorealism. The image editing arena also shows substantial gains, confirming the effectiveness of the editing-specific RL training.

Reinforcement learning training lifts Qwen-Image-2.0's overall score on Qwen-Image-Bench from 55.23 to 57.84, with consistent gains across all five evaluation pillars. The largest improvements appear in Creative Generation and Real-world Fidelity, making the RL-trained model the strongest performer among all compared systems. Creative Generation benefits the most from RL, gaining 6.72 points and highlighting enhanced imaginative capabilities. Real-world Fidelity rises by 4.29 points, reflecting better structural consistency and fine detail rendering.

Reinforcement learning training improves Qwen-Image-2.0's overall performance on Qwen-Image-Bench, yielding consistent gains across all evaluation pillars. The largest improvements occur in Creative Generation and Real-world Fidelity, making the RL-trained model the strongest performer among compared systems. These gains reflect enhanced imaginative capabilities, better structural consistency, and finer detail rendering.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています