OmniGen2: Open-Source Model Achieves State-of-the-Art Results in Multimodal Generation Tasks
In this work, we introduce OmniGen2, an advanced open-source generative model designed to tackle a wide range of tasks, including text-to-image, image editing, and in-context generation. Unlike its predecessor, OmniGen, OmniGen2 utilizes two distinct decoding paths—one for text and one for images—each equipped with its own set of parameters and an independent image tokenizer. This design allows OmniGen2 to integrate seamlessly with existing multimodal models without altering the input requirements for variational autoencoders (VAEs), thereby maintaining robust text generation capabilities. To train OmniGen2, we developed comprehensive data pipelines, incorporating datasets specifically tailored for image editing and in-context generation. Additionally, we created a reflection mechanism dedicated to enhancing image generation, along with a corresponding reflection dataset. Despite its relatively compact size in terms of parameters, OmniGen2 demonstrates strong performance across various generative tasks, particularly in text-to-image and image editing. For in-context generation, often referred to as subject-driven tasks, we established a new benchmark called OmniContext. OmniGen2 achieves state-of-the-art consistency in this category among open-source models. Our commitment to advancing the field extends beyond this model. We intend to make available our models, training code, datasets, and data pipelines to support and facilitate future research. For more detailed information, visit the project page at https://vectorspacelab.github.io/OmniGen2 or explore the GitHub repository at https://github.com/VectorSpaceLab/OmniGen2.