18 days ago

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun

View Paper Details View Code

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe

Abstract

Multimodal Large Language Models (MLLMs) are undergoing rapid progress andrepresent the frontier of AI development. However, their training and inferenceefficiency have emerged as a core bottleneck in making MLLMs more accessibleand scalable. To address the challenges, we present MiniCPM-V 4.5, an 8Bparameter model designed for high efficiency and strong performance. Weintroduce three core improvements in model architecture, data strategy andtraining method: a unified 3D-Resampler model architecture for highly compactencoding over images and videos, a unified learning paradigm for documentknowledge and text recognition without heavy data engineering, and a hybridreinforcement learning strategy for proficiency in both short and longreasoning modes. Comprehensive experimental results in OpenCompass evaluationshow that MiniCPM-V 4.5 surpasses widely used proprietary models such asGPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL72B. Notably, the strong performance is achieved with remarkable efficiency.For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achievesstate-of-the-art performance among models under 30B size, using just 46.7\% GPUmemory cost and 8.7\% inference time of Qwen2.5-VL 7B.