7 days ago

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkablecapabilities in executing instructions for a variety of single-image tasks.Despite this progress, significant challenges remain in modeling long imagesequences. In this work, we introduce the versatile multi-modal large languagemodel, mPLUG-Owl3, which enhances the capability for long image-sequenceunderstanding in scenarios that incorporate retrieved image-text knowledge,interleaved image-text, and lengthy videos. Specifically, we propose novelhyper attention blocks to efficiently integrate vision and language into acommon language-guided semantic space, thereby facilitating the processing ofextended multi-image scenarios. Extensive experimental results suggest thatmPLUG-Owl3 achieves state-of-the-art performance among models with a similarsize on single-image, multi-image, and video benchmarks. Moreover, we propose achallenging long visual sequence evaluation named Distractor Resistance toassess the ability of models to maintain focus amidst distractions. Finally,with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performanceon ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute tothe development of more efficient and powerful multimodal large languagemodels.