16 days ago

Ovis-U1 Technical Report

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen

View Paper Details View Code

Abstract

In this report, we introduce Ovis-U1, a 3-billion-parameter unified modelthat integrates multimodal understanding, text-to-image generation, and imageediting capabilities. Building on the foundation of the Ovis series, Ovis-U1incorporates a diffusion-based visual decoder paired with a bidirectional tokenrefiner, enabling image generation tasks comparable to leading models likeGPT-4o. Unlike some previous models that use a frozen MLLM for generationtasks, Ovis-U1 utilizes a new unified training approach starting from alanguage model. Compared to training solely on understanding or generationtasks, unified training yields better performance, demonstrating theenhancement achieved by integrating these two tasks. Ovis-U1 achieves a scoreof 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recentstate-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. Intext-to-image generation, it excels with scores of 83.72 and 0.89 on theDPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As theinitial version of the Ovis unified model series, Ovis-U1 pushes the boundariesof multimodal understanding, generation, and editing.