MiniCPM-V: A GPT-4V Level MLLM on Your Phone

The recent surge of Multimodal Large Language Models (MLLMs) hasfundamentally reshaped the landscape of AI research and industry, sheddinglight on a promising path toward the next AI milestone. However, significantchallenges remain preventing MLLMs from being practical in real-worldapplications. The most notable challenge comes from the huge cost of running anMLLM with a massive number of parameters and extensive computation. As aresult, most MLLMs need to be deployed on high-performing cloud servers, whichgreatly limits their application scopes such as mobile, offline,energy-sensitive, and privacy-protective scenarios. In this work, we presentMiniCPM-V, a series of efficient MLLMs deployable on end-side devices. Byintegrating the latest MLLM techniques in architecture, pretraining andalignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1)Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 onOpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strongOCR capability and 1.8M pixel high-resolution image perception at any aspectratio, (3) trustworthy behavior with low hallucination rates, (4) multilingualsupport for 30+ languages, and (5) efficient deployment on mobile phones. Moreimportantly, MiniCPM-V can be viewed as a representative example of a promisingtrend: The model sizes for achieving usable (e.g., GPT-4V) level performanceare rapidly decreasing, along with the fast growth of end-side computationcapacity. This jointly shows that GPT-4V level MLLMs deployed on end devicesare becoming increasingly possible, unlocking a wider spectrum of real-world AIapplications in the near future.