Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Recent years have seen remarkable progress in both multimodal understandingmodels and image generation models. Despite their respective successes, thesetwo domains have evolved independently, leading to distinct architecturalparadigms: While autoregressive-based architectures have dominated multimodalunderstanding, diffusion-based models have become the cornerstone of imagegeneration. Recently, there has been growing interest in developing unifiedframeworks that integrate these tasks. The emergence of GPT-4o's newcapabilities exemplifies this trend, highlighting the potential forunification. However, the architectural differences between the two domainspose significant challenges. To provide a clear overview of current effortstoward unification, we present a comprehensive survey aimed at guiding futureresearch. First, we introduce the foundational concepts and recent advancementsin multimodal understanding and text-to-image generation models. Next, wereview existing unified models, categorizing them into three main architecturalparadigms: diffusion-based, autoregressive-based, and hybrid approaches thatfuse autoregressive and diffusion mechanisms. For each category, we analyze thestructural designs and innovations introduced by related works. Additionally,we compile datasets and benchmarks tailored for unified models, offeringresources for future exploration. Finally, we discuss the key challenges facingthis nascent field, including tokenization strategy, cross-modal attention, anddata. As this area is still in its early stages, we anticipate rapidadvancements and will regularly update this survey. Our goal is to inspirefurther research and provide a valuable reference for the community. Thereferences associated with this survey are available on GitHub(https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).