HyperAIHyperAI

Command Palette

Search for a command to run...

InterleaveThinker: エージェント型交互生成の強化

Dian Zheng Harry Lee Manyuan Zhang Kaituo Feng Zoey Guo Ray Zhang Hongsheng Li

概要

最近の画像生成モデルは、単一画像の生成および編集において、印象的な写実性と指示追従能力を実証している。しかし、アーキテクチャの制約により、視覚的ナラティブ、ガイダンス、および身体化操作において重要な応用を持つインターリーブ生成(テキスト-画像シーケンス)を実現することができない。最新のオープンソース統一マルチモーダルモデル(UMMs)でさえ、この点において限定的な性能しか発揮していない。本論文では、既存の任意の画像生成モデルにインターリーブ生成能力を付与するために設計された初のmulti-agent pipelineであるInterleaveThinkerを紹介する。具体的には、planner agentを用いて画像-テキスト入力シーケンスを整理し、各ステップで必要な処理を画像生成モデルに指示する。続いて、生成モデルの出力を評価し、計画された指示から逸脱したサンプルを特定し、再生成のための指示を精緻化するcritic agentを導入する。このパイプラインを実装するため、フォーマットのコールドスタートを実行するInterleave-Planner-SFT-80kおよびInterleave-Critic-SFT-112kを構築する。次に、GRPOを用いて生成軌跡内におけるステップごとの指示修正能力を強化するInterleave-Critic-RL-13kを開発する。単一のインターリーブ生成軌跡には25回以上の生成モデル呼び出しが含まれる可能性があるため、軌跡全体を最適化することは計算コストの観点から非現実的である。したがって、単一ステップの強化学習が生成軌跡全体を効果的に誘導できるように、精度報酬およびステップごとの報酬を提案する。実験結果は、InterleaveThinkerが様々な画像生成モデルにおいて性能を向上させることを示している。インターリーブ生成ベンチマークにおいて、Nano BananaおよびGPT-5と同等の性能を達成している。驚くべきことに、推論ベースのベンチマークにおいてもベースモデルの性能を大幅に向上させており、例えば4-step FLUX.2-kleinにおいて、WISEおよびRISEで顕著な向上が観測された。

One-sentence Summary

InterleaveThinker is a multi-agent pipeline that equips existing image generators with interleaved text-image sequence generation by coordinating a planner agent to structure stepwise instructions and a critic agent to evaluate outputs and refine subsequent prompts, with stepwise instruction correction within generation trajectories reinforced through GRPO to address the architectural constraints of prior unified multimodal models in visual narratives and embodied manipulation.

Key Contributions

  • The paper introduces InterleaveThinker, a multi-agent pipeline that retrofits frozen image generators with interleaved text-image sequence generation without modifying their base architectures. A planner agent structures the execution steps while a critic agent evaluates outputs, identifies deviations, and refines prompts to ensure strict trajectory adherence.
  • Training is enabled through three curated datasets, Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k for format cold-starting, and Interleave-Critic-RL-13k for reinforcement learning. A GRPO-based optimization with a dual-reward strategy comprising accuracy and step-wise rewards efficiently aligns long-horizon generation trajectories at reduced computational cost.
  • Evaluated on off-the-shelf generators such as FLUX.2-klein, the framework surpasses open-source unified multimodal models on interleaved generation benchmarks and matches proprietary systems like Nano Banana and GPT-5. The approach also substantially improves reasoning performance on the WISE benchmark (0.47 to 0.73) and the RISE benchmark (13.3 to 28.9).

Introduction

Modern image generation models excel at single-image synthesis, yet practical applications like visual storytelling and embodied manipulation demand interleaved generation that seamlessly alternates text and image outputs. While Unified Multimodal Models attempt to support this workflow, they frequently exhibit visual over-reliance on intermediate states and suffer from compounding step-by-step errors during extended sequences. To address these limitations, the authors propose InterleaveThinker, a multi-agent framework that retrofits frozen image generators with robust sequential capabilities. The system utilizes a Planner agent to forecast complete instruction trajectories upfront, effectively bypassing premature visual dependency, while a Critic agent evaluates outputs and refines prompts to prevent error accumulation. By combining this architecture with a curated training dataset and a dual-reward reinforcement learning strategy, the authors achieve trajectory-level alignment that matches proprietary models and significantly enhances base model reasoning.

Dataset

  • Dataset Composition and Sources: The authors generate roughly 40,000 text prompts through a top-down pipeline that starts with 8 broad domains, expands to 75 fine-grained subcategories, and leverages Gemini 2.5 Pro to build domain-specific vocabulary banks and instructional templates. Multi-agent trajectory generation combines Gemini 2.5 Pro and Nano Banana Pro, with FLUX.2-klein-9B added to balance visual quality and prevent critic bias. The final corpus also integrates existing open-source interleaved datasets to supplement planner training.
  • Subset Details and Filtering: Interleave-Critic-SFT-112k contains 112,000 samples filtered for successful refinement trends, stable high scores, and low iteration score variance. Interleave-Critic-RL-13k holds 13,000 samples selected for high score variance to capture dynamic refinement processes, maintaining a strict 2:1 ratio with the SFT subset. Interleave-Planner-SFT-80k comprises 80,000 samples that bypass critic filtering entirely, preserving the original unfiltered trajectories for planner training.
  • Training Splits and Processing: The pipeline decomposes full trajectories into independent step-wise segments to enable stable single-iteration optimization instead of computationally prohibitive end-to-end reinforcement learning. Each refinement step is scored from 0 to 10 for semantic alignment and visual quality using Gemini 2.5 Pro adapted from VIEScore. The authors apply targeted resampling to balance the binary judgment distribution for the critic, ensuring unbiased training across iteration-wise predictions.
  • Metadata and Structural Processing: Planner training pairs are constructed by randomly truncating interleaved text-image sequences, where the preceding context serves as input and the subsequent text plan acts as the target output. Metadata explicitly tracks original user instructions, rewritten refinement prompts, and paired original versus generated images to support step-wise evaluation. The filtering pipeline discards steps exhibiting negative refinement trends or persistent low quality, retaining only those that demonstrate successful iterative improvement.

Experiment

The evaluation employs a multi-agent InterleaveThinker framework to validate performance across interleaved generation and reasoning-based editing benchmarks using both in-domain and generalization image models. Results demonstrate that the approach significantly outperforms existing open-source methods by effectively mitigating visual over-reliance and step-wise error accumulation while preserving textual fidelity and image quality. Ablation studies confirm that the dedicated planner-critic architecture, fine-tuned training stages, and closed-loop refinement process are essential for robust performance, as single-model or unfiltered alternatives consistently degrade results. Although the framework encounters limitations with out-of-domain concepts unknown to the base generator, it remains a highly generalizable and model-agnostic solution for complex multimodal tasks.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています