Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

In this work, we introduce Mini-Gemini, a simple and effective frameworkenhancing multi-modality Vision Language Models (VLMs). Despite theadvancements in VLMs facilitating basic visual dialog and reasoning, aperformance gap persists compared to advanced models like GPT-4 and Gemini. Wetry to narrow the gap by mining the potential of VLMs for better performanceand any-to-any workflow from three aspects, i.e., high-resolution visualtokens, high-quality data, and VLM-guided generation. To enhance visual tokens,we propose to utilize an additional visual encoder for high-resolutionrefinement without increasing the visual token count. We further construct ahigh-quality dataset that promotes precise image comprehension andreasoning-based generation, expanding the operational scope of current VLMs. Ingeneral, Mini-Gemini further mines the potential of VLMs and empowers currentframeworks with image understanding, reasoning, and generation simultaneously.Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)from 2B to 34B. It is demonstrated to achieve leading performance in severalzero-shot benchmarks and even surpasses the developed private models. Code andmodels are available at https://github.com/dvlab-research/MiniGemini.