UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan

발행일: 6/4/2025

UniWorld: High-Resolution Semantic Encoders for Unified Visual
Understanding and Generation

초록

Although existing unified models deliver strong performance onvision-language understanding and text-to-image generation, their models arelimited in exploring image perception and manipulation tasks, which areurgently desired by users for wide applications. Recently, OpenAI releasedtheir powerful GPT-4o-Image model for comprehensive image perception andmanipulation, achieving expressive capability and attracting communityinterests. By observing the performance of GPT-4o-Image in our carefullyconstructed experiments, we infer that GPT-4o-Image leverages featuresextracted by semantic encoders instead of VAE, while VAEs are consideredessential components in many image manipulation models. Motivated by suchinspiring observations, we present a unified generative framework namedUniWorld based on semantic features provided by powerful visual-language modelsand contrastive semantic encoders. As a result, we build a strong unified modelusing only 1% amount of BAGEL's data, which consistently outperforms BAGEL onimage editing benchmarks. UniWorld also maintains competitive imageunderstanding and generation capabilities, achieving strong performance acrossmultiple image perception tasks. We fully open-source our models, includingmodel weights, training and evaluation scripts, and datasets.

논문 세부 정보 보기 View Code