Command Palette
Search for a command to run...
UniReason 1.0: 세계 지식을 일치시키는 이미지 생성 및 편집을 위한 통합 추론 프레임워크
UniReason 1.0: 세계 지식을 일치시키는 이미지 생성 및 편집을 위한 통합 추론 프레임워크
초록
통합 다중모달 모델은 깊이 있는 추론을 필요로 하는 복잡한 합성 작업에서 종종 어려움을 겪하며, 텍스트에서 이미지 생성과 이미지 편집을 서로 분리된 기능으로 취급하는 경우가 많다. 이를 해결하기 위해 우리는 이 두 작업을 이중 추론 파라다임을 통해 조화롭게 통합하는 유니리슨(UniReason)이라는 통합 프레임워크를 제안한다. 생성 과정을 세계 지식을 강화한 계획(planning)으로 설정하여 암묵적인 제약 조건을 도입하고, 이미지 편집 능력을 활용해 세밀한 시각적 보정을 수행함으로써 자기 반성(self-reflection)을 통해 시각적 오류를 추가로 수정한다. 이러한 접근은 생성과 편집을 공유된 표현 공간 내에서 통합함으로써 인간의 인지 과정인 ‘계획 후 보정’을 모방한다. 본 프레임워크를 뒷받침하기 위해, 계획에 활용할 수 있는 다섯 가지 주요 지식 영역(예: 문화적 공감, 물리학 등)을 포함하는 대규모 추론 중심 데이터셋(~30만 개 샘플)을 체계적으로 구축하였으며, 시각적 자기 보정을 위한 에이전트 생성 코퍼스도 함께 준비하였다. 광범위한 실험을 통해 UniReason가 WISE, KrisBench, UniREditBench와 같은 추론 중심 벤치마크에서 뛰어난 성능을 달성함과 동시에, 뛰어난 일반화 합성 능력을 유지함을 입증하였다.
One-sentence Summary
Fudan University and Shanghai Innovation Institute researchers propose UniReason, a unified framework merging text-to-image generation and editing via dual reasoning—planning with world knowledge and self-reflective refinement—outperforming prior models on reasoning benchmarks while enabling coherent, error-corrected visual synthesis across diverse domains.
Key Contributions
- UniReason introduces a unified framework that treats text-to-image generation and image editing as interconnected reasoning steps, leveraging shared representations to mirror human planning and refinement, thereby closing the gap between abstract intent and faithful visual output.
- It combines world knowledge-enhanced textual reasoning to inject implicit constraints before generation and fine-grained visual refinement via self-reflection after generation, enabling iterative correction and mutual capability transfer between synthesis and editing tasks.
- The framework is supported by a large-scale reasoning dataset (~300k samples) spanning five knowledge domains and an agent-generated corpus for self-correction, achieving state-of-the-art results on benchmarks including WISE, KrisBench, and UniREditBench.
Introduction
The authors leverage a unified framework called UniReason to bridge the gap between text-to-image generation and image editing by treating them as interconnected reasoning steps rather than isolated tasks. Prior methods either reason only before generation without visual feedback or treat generation and editing separately, leaving implicit world knowledge—like physics or cultural commonsense—untapped and failing to exploit their structural synergy. UniReason introduces two key components: world knowledge-enhanced textual reasoning to infer implicit constraints before synthesis, and fine-grained editing-like visual refinement to self-correct errors afterward, all trained on a newly constructed dataset spanning five knowledge domains and an agent-generated refinement corpus. This unified approach achieves state-of-the-art results on reasoning-heavy benchmarks while maintaining strong general synthesis performance.
Dataset
The authors use a two-phase dataset construction pipeline to train UniReason for world knowledge-enhanced multimodal reasoning in image synthesis. Here’s how the data is composed, processed, and used:
-
Dataset Composition and Sources
- Built for both text-to-image (T2I) generation and image editing.
- Draws from Wikipedia for seed prompts, UniREdit-Data-100K for editing triples, and ShareGPT-4o-Image and Midjourney prompts for refinement training.
- Five world knowledge categories are covered: Cultural Commonsense, Natural Science, Spatial Reasoning, Temporal Reasoning, and Logical Reasoning.
-
Key Details per Subset
- T2I Generation: Seed prompts manually crafted from Wikipedia; expanded and paired with CoT reasoning using Gemini-2.5 Pro. Images rendered via Qwen-Image.
- Image Editing: Uses UniREdit-Data-100K triples; augmented with Gemini-2.5 Pro-generated reasoning traces.
- Refinement Data: Generated via an agent pipeline — initial image + reasoning → Gemini-2.5 Pro feedback → Qwen-Image-Edit refinement → Gemini-2.5 Pro judge for improvement validation.
-
Data Filtering and Quality Control
- All samples are evaluated by Gemini-2.5 Pro across three dimensions: instruction alignment, visual fidelity, and reasoning correctness.
- Only verified, hallucination-free samples are retained.
- Refinement phase includes comparative evaluation to ensure measurable visual and semantic improvements.
-
Training Usage and Processing
- Training data mixes T2I and editing samples, each paired with structured reasoning traces.
- Mixture ratios are not explicitly specified but emphasize balanced coverage across the five knowledge categories.
- No cropping or metadata construction is mentioned; focus is on reasoning trace generation and iterative refinement supervision.
Method
The authors leverage a unified multimodal reasoning framework, termed UniReason, designed to support both text-to-image (T2I) generation and image editing through interleaved textual reasoning and visual refinement. The architecture builds upon Bagel’s Mixture-of-Transformers (MoT) foundation, integrating a ViT encoder to process multimodal inputs and enabling joint understanding and generation within a single model. Multimodal understanding is formalized as next-token prediction conditioned on context, minimizing the negative log-likelihood:
Ltext=−t=1∑Tlogpθ(xt∣x<t,C),where xt is the target token, x<t the preceding tokens, and C the multimodal context. Image generation operates in a VAE latent space via rectified flow, minimizing the latent flow-matching loss:
Limage=Et∼U(0,1)∥uθ(zt,t;C)−u⋆(zt,t)∥γ2,with u∗ as the target velocity and uθ the learned time-conditioned velocity field.
The framework operates in two sequential phases. Phase 1, World Knowledge-Enhanced Textual Reasoning, initiates synthesis by generating a reasoning trace that incorporates domain-specific knowledge—such as style, object attributes, or spatial constraints—before producing an initial draft image. This phase is supported by a data pipeline that expands seed prompts using world knowledge sources (e.g., Commonsense, Natural Science) and generates reasoning chains via Gemini 2.5 Pro, followed by image rendering via Qwen-Image and verification.
Phase 2, Fine-grained Editing-like Visual Refinement, iteratively improves the draft by reflecting on its shortcomings. The model generates textual reflections identifying inconsistencies or missing details—such as object presence, attribute accuracy, or aesthetic quality—and optionally performs a second round of reasoning to guide visual refinement. This process mirrors image editing, enabling a synergistic loop where T2I generation benefits from editing expertise and vice versa. The refinement data is constructed using an agent pipeline: an initial generator produces a draft, a verifier (Gemini 2.5 Pro) outputs structured edit directives, a refinement teacher (Qwen-Image-Edit) applies them, and a final judge retains only improved outputs.
Training proceeds via a two-stage supervised fine-tuning strategy. Stage 1 freezes the understanding branch and trains only the generation branch on standard T2I and editing datasets to strengthen foundational synthesis. Stage 2 unfreezes all parameters and jointly trains both branches on interleaved reasoning data, supervising reasoning traces and refined images while leaving initial drafts unsupervised. The overall loss is a weighted sum:
L=λtextLtext+λimgLimg,where Ltext and Limg are the text and image losses, and λtext, λimg are scalar weights balancing their contributions.
Experiment
- Two-stage training successfully integrates knowledge-enhanced reasoning and visual refinement into a unified multimodal model, significantly boosting performance on knowledge-intensive tasks.
- The model outperforms leading open-source unified models and matches or exceeds closed-source counterparts like GPT-4o and Gemini 2.0 on key benchmarks, particularly in cultural, spatial, and scientific reasoning.
- It maintains strong general generation and editing capabilities, surpassing systems like Qwen-Image and Seedream 4.0 on general benchmarks without relying on external LLMs.
- Ablation studies confirm that each component—two-stage training, textual reasoning, and visual refinement—contributes progressively to improved outcomes, with reasoning yielding the largest gains.
- Image editing proficiency directly correlates with refinement effectiveness, underscoring the need for joint training of generation and editing within a unified reasoning framework.
- Qualitative results show the model handles complex reasoning scenarios and refines fine details like faces, text, and gestures, enhancing both accuracy and visual fidelity.
The authors use a two-stage training approach to enhance a unified multimodal model’s ability to perform both image generation and editing while incorporating explicit reasoning. Results show the model outperforms other open-source unified models across knowledge-intensive tasks and matches or exceeds closed-source systems in specific benchmarks, particularly in cultural, spatial, and scientific reasoning. It also maintains strong performance on general generation and editing tasks, demonstrating broad capability without sacrificing versatility.

The authors use a progressive ablation study to show that adding two-stage training, reasoning, and refinement components sequentially improves performance across knowledge-intensive benchmarks. Results show that each stage contributes meaningfully, with reasoning providing the largest gain on WISE and refinement further boosting scores across all tasks. This indicates that integrating explicit reasoning and visual refinement into a unified framework enhances both knowledge handling and output quality.

The authors evaluate their model on general image editing benchmarks, showing it achieves the highest overall score among unified models with reasoning capabilities. Results indicate strong performance across fine-grained editing categories like Add, Adjust, and Replace, while also maintaining competitive scores on GEdit-EN. This demonstrates the model’s ability to balance general editing proficiency with reasoning-enhanced control.

The authors use a unified multimodal model with explicit reasoning and refinement mechanisms to achieve top performance among open-source systems on knowledge-intensive image generation and editing tasks. Results show the model outperforms comparable unified models and rivals closed-source systems like GPT-4o and Gemini 2.0 on specific benchmarks, particularly in factual and conceptual reasoning. The model also maintains strong general generation and editing capabilities, demonstrating broad applicability without sacrificing versatility.

The authors use a two-stage training approach to enhance a unified multimodal model’s ability to perform both text-to-image generation and image editing while incorporating reasoning. Results show the model achieves top performance among open-source systems on general benchmarks and outperforms several closed-source models in specific knowledge-intensive tasks. It also maintains strong generalization across both generation and editing, indicating that reasoning integration does not compromise broad capability.
