UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Although existing unified models deliver strong performance onvision-language understanding and text-to-image generation, their models arelimited in exploring image perception and manipulation tasks, which areurgently desired by users for wide applications. Recently, OpenAI releasedtheir powerful GPT-4o-Image model for comprehensive image perception andmanipulation, achieving expressive capability and attracting communityinterests. By observing the performance of GPT-4o-Image in our carefullyconstructed experiments, we infer that GPT-4o-Image leverages featuresextracted by semantic encoders instead of VAE, while VAEs are consideredessential components in many image manipulation models. Motivated by suchinspiring observations, we present a unified generative framework namedUniWorld based on semantic features provided by powerful visual-language modelsand contrastive semantic encoders. As a result, we build a strong unified modelusing only 1% amount of BAGEL's data, which consistently outperforms BAGEL onimage editing benchmarks. UniWorld also maintains competitive imageunderstanding and generation capabilities, achieving strong performance acrossmultiple image perception tasks. We fully open-source our models, includingmodel weights, training and evaluation scripts, and datasets.