2 months ago

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun

View Paper Details

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Abstract

The recent surge of Multimodal Large Language Models (MLLMs) hasfundamentally reshaped the landscape of AI research and industry, sheddinglight on a promising path toward the next AI milestone. However, significantchallenges remain preventing MLLMs from being practical in real-worldapplications. The most notable challenge comes from the huge cost of running anMLLM with a massive number of parameters and extensive computation. As aresult, most MLLMs need to be deployed on high-performing cloud servers, whichgreatly limits their application scopes such as mobile, offline,energy-sensitive, and privacy-protective scenarios. In this work, we presentMiniCPM-V, a series of efficient MLLMs deployable on end-side devices. Byintegrating the latest MLLM techniques in architecture, pretraining andalignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1)Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 onOpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strongOCR capability and 1.8M pixel high-resolution image perception at any aspectratio, (3) trustworthy behavior with low hallucination rates, (4) multilingualsupport for 30+ languages, and (5) efficient deployment on mobile phones. Moreimportantly, MiniCPM-V can be viewed as a representative example of a promisingtrend: The model sizes for achieving usable (e.g., GPT-4V) level performanceare rapidly decreasing, along with the fast growth of end-side computationcapacity. This jointly shows that GPT-4V level MLLMs deployed on end devicesare becoming increasingly possible, unlocking a wider spectrum of real-world AIapplications in the near future.