HyperAIHyperAI

Command Palette

Search for a command to run...

InterleaveThinker: Verstärkung der agentischen verschränkten Generierung

Dian Zheng Harry Lee Manyuan Zhang Kaituo Feng Zoey Guo Ray Zhang Hongsheng Li

Zusammenfassung

Aktuelle Bildgeneratoren haben beeindruckende fotorealistische Qualitäten sowie Fähigkeiten zur Anweisungsbefolgung bei der Einzelbildgenerierung und -bearbeitung demonstriert. Aufgrund ihrer Architekturen können sie jedoch keine interleaved generation (Text-Bild-Sequenz) realisieren, die für visuelle Erzählungen, Steuerung und verkörperte Manipulation von entscheidender Bedeutung ist. Selbst die neuesten Open-Source Unified Multimodal Models (UMMs) weisen in dieser Hinsicht nur begrenzte Leistungsfähigkeit auf. In dieser Arbeit stellen wir InterleaveThinker vor, die erste Multi-agent pipeline, die darauf ausgelegt ist, jeden bestehenden Bildgenerator mit Fähigkeiten zur interleaved generation auszustatten. Konkret setzen wir einen planner agent ein, um die Bild-Text-Eingabesequenz zu organisieren und den Bildgenerator bei jedem Schritt über die erforderliche Ausführung zu instruieren. Anschließend stellen wir einen critic agent vor, der die Ausgaben des Generators bewertet, Abweichungen von den geplanten Anweisungen identifiziert und die Anweisungen zur Regeneration verfeinert. Zur Implementierung dieser Pipeline entwickeln wir Interleave-Planner-SFT-80k und Interleave-Critic-SFT-112k, um einen Format-Cold-Start durchzuführen. Anschließend entwickeln wir Interleave-Critic-RL-13k, um die schrittweise Korrekturfähigkeit von Anweisungen innerhalb einer Generierungstrajektorie mittels GRPO zu verstärken. Da eine einzelne interleaved generation trajectory mehr als 25 Generatoraufrufe umfassen kann, ist die Optimierung der gesamten Trajektorie rechnerisch nicht praktikabel. Daher schlagen wir accuracy reward und step-wise reward vor, die es ermöglichen, dass single-step RL die gesamte Generierungstrajektorie effektiv steuert. Die Ergebnisse zeigen, dass InterleaveThinker die Leistungsfähigkeit verschiedener Bildgeneratoren verbessert. Auf Benchmarks für interleaved generation erzielt es eine Leistung, die mit Nano Banana und GPT-5 vergleichbar ist. Überraschenderweise verbessert es zudem die Leistung des Basismodells auf reasoning-basierten Benchmarks erheblich; so verzeichnen wir beispielsweise auf 4-step FLUX.2-klein deutliche Fortschritte bei WISE und RISE.

One-sentence Summary

InterleaveThinker is a multi-agent pipeline that equips existing image generators with interleaved text-image sequence generation by coordinating a planner agent to structure stepwise instructions and a critic agent to evaluate outputs and refine subsequent prompts, with stepwise instruction correction within generation trajectories reinforced through GRPO to address the architectural constraints of prior unified multimodal models in visual narratives and embodied manipulation.

Key Contributions

  • The paper introduces InterleaveThinker, a multi-agent pipeline that retrofits frozen image generators with interleaved text-image sequence generation without modifying their base architectures. A planner agent structures the execution steps while a critic agent evaluates outputs, identifies deviations, and refines prompts to ensure strict trajectory adherence.
  • Training is enabled through three curated datasets, Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k for format cold-starting, and Interleave-Critic-RL-13k for reinforcement learning. A GRPO-based optimization with a dual-reward strategy comprising accuracy and step-wise rewards efficiently aligns long-horizon generation trajectories at reduced computational cost.
  • Evaluated on off-the-shelf generators such as FLUX.2-klein, the framework surpasses open-source unified multimodal models on interleaved generation benchmarks and matches proprietary systems like Nano Banana and GPT-5. The approach also substantially improves reasoning performance on the WISE benchmark (0.47 to 0.73) and the RISE benchmark (13.3 to 28.9).

Introduction

Modern image generation models excel at single-image synthesis, yet practical applications like visual storytelling and embodied manipulation demand interleaved generation that seamlessly alternates text and image outputs. While Unified Multimodal Models attempt to support this workflow, they frequently exhibit visual over-reliance on intermediate states and suffer from compounding step-by-step errors during extended sequences. To address these limitations, the authors propose InterleaveThinker, a multi-agent framework that retrofits frozen image generators with robust sequential capabilities. The system utilizes a Planner agent to forecast complete instruction trajectories upfront, effectively bypassing premature visual dependency, while a Critic agent evaluates outputs and refines prompts to prevent error accumulation. By combining this architecture with a curated training dataset and a dual-reward reinforcement learning strategy, the authors achieve trajectory-level alignment that matches proprietary models and significantly enhances base model reasoning.

Dataset

  • Dataset Composition and Sources: The authors generate roughly 40,000 text prompts through a top-down pipeline that starts with 8 broad domains, expands to 75 fine-grained subcategories, and leverages Gemini 2.5 Pro to build domain-specific vocabulary banks and instructional templates. Multi-agent trajectory generation combines Gemini 2.5 Pro and Nano Banana Pro, with FLUX.2-klein-9B added to balance visual quality and prevent critic bias. The final corpus also integrates existing open-source interleaved datasets to supplement planner training.
  • Subset Details and Filtering: Interleave-Critic-SFT-112k contains 112,000 samples filtered for successful refinement trends, stable high scores, and low iteration score variance. Interleave-Critic-RL-13k holds 13,000 samples selected for high score variance to capture dynamic refinement processes, maintaining a strict 2:1 ratio with the SFT subset. Interleave-Planner-SFT-80k comprises 80,000 samples that bypass critic filtering entirely, preserving the original unfiltered trajectories for planner training.
  • Training Splits and Processing: The pipeline decomposes full trajectories into independent step-wise segments to enable stable single-iteration optimization instead of computationally prohibitive end-to-end reinforcement learning. Each refinement step is scored from 0 to 10 for semantic alignment and visual quality using Gemini 2.5 Pro adapted from VIEScore. The authors apply targeted resampling to balance the binary judgment distribution for the critic, ensuring unbiased training across iteration-wise predictions.
  • Metadata and Structural Processing: Planner training pairs are constructed by randomly truncating interleaved text-image sequences, where the preceding context serves as input and the subsequent text plan acts as the target output. Metadata explicitly tracks original user instructions, rewritten refinement prompts, and paired original versus generated images to support step-wise evaluation. The filtering pipeline discards steps exhibiting negative refinement trends or persistent low quality, retaining only those that demonstrate successful iterative improvement.

Experiment

The evaluation employs a multi-agent InterleaveThinker framework to validate performance across interleaved generation and reasoning-based editing benchmarks using both in-domain and generalization image models. Results demonstrate that the approach significantly outperforms existing open-source methods by effectively mitigating visual over-reliance and step-wise error accumulation while preserving textual fidelity and image quality. Ablation studies confirm that the dedicated planner-critic architecture, fine-tuned training stages, and closed-loop refinement process are essential for robust performance, as single-model or unfiltered alternatives consistently degrade results. Although the framework encounters limitations with out-of-domain concepts unknown to the base generator, it remains a highly generalizable and model-agnostic solution for complex multimodal tasks.


KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp