MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Multimodal Large Language Models (MLLMs) are undergoing rapid progress andrepresent the frontier of AI development. However, their training and inferenceefficiency have emerged as a core bottleneck in making MLLMs more accessibleand scalable. To address the challenges, we present MiniCPM-V 4.5, an 8Bparameter model designed for high efficiency and strong performance. Weintroduce three core improvements in model architecture, data strategy andtraining method: a unified 3D-Resampler model architecture for highly compactencoding over images and videos, a unified learning paradigm for documentknowledge and text recognition without heavy data engineering, and a hybridreinforcement learning strategy for proficiency in both short and longreasoning modes. Comprehensive experimental results in OpenCompass evaluationshow that MiniCPM-V 4.5 surpasses widely used proprietary models such asGPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL72B. Notably, the strong performance is achieved with remarkable efficiency.For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achievesstate-of-the-art performance among models under 30B size, using just 46.7\% GPUmemory cost and 8.7\% inference time of Qwen2.5-VL 7B.