HyperAI超神経

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Bingda Tang, Boyang Zheng, Xichen Pan, Sayak Paul, Saining Xie
公開日: 5/18/2025
Exploring the Deep Fusion of Large Language Models and Diffusion
  Transformers for Text-to-Image Synthesis
要約

This paper does not describe a new method; instead, it provides a thoroughexploration of an important yet understudied design space related to recentadvances in text-to-image synthesis -- specifically, the deep fusion of largelanguage models (LLMs) and diffusion transformers (DiTs) for multi-modalgeneration. Previous studies mainly focused on overall system performancerather than detailed comparisons with alternative methods, and key designdetails and training recipes were often left undisclosed. These gaps createuncertainty about the real potential of this approach. To fill these gaps, weconduct an empirical study on text-to-image generation, performing controlledcomparisons with established baselines, analyzing important design choices, andproviding a clear, reproducible recipe for training at scale. We hope this workoffers meaningful data points and practical guidelines for future research inmulti-modal generation.