OmniGen2: Exploration to Advanced Multimodal Generation

In this work, we introduce OmniGen2, a versatile and open-source generativemodel designed to provide a unified solution for diverse generation tasks,including text-to-image, image editing, and in-context generation. UnlikeOmniGen v1, OmniGen2 features two distinct decoding pathways for text and imagemodalities, utilizing unshared parameters and a decoupled image tokenizer. Thisdesign enables OmniGen2 to build upon existing multimodal understanding modelswithout the need to re-adapt VAE inputs, thereby preserving the original textgeneration capabilities. To facilitate the training of OmniGen2, we developedcomprehensive data construction pipelines, encompassing image editing andin-context generation data. Additionally, we introduce a reflection mechanismtailored for image generation tasks and curate a dedicated reflection datasetbased on OmniGen2. Despite its relatively modest parameter size, OmniGen2achieves competitive results on multiple task benchmarks, includingtext-to-image and image editing. To further evaluate in-context generation,also referred to as subject-driven tasks, we introduce a new benchmark namedOmniContext. OmniGen2 achieves state-of-the-art performance among open-sourcemodels in terms of consistency. We will release our models, training code,datasets, and data construction pipeline to support future research in thisfield. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link:https://github.com/VectorSpaceLab/OmniGen2