Vision Language Models: Advances and Innovations in Multimodal AI Over the Past Year
Vision Language Models (VLMs) have made significant strides over the past year, evolving from large, unwieldy models to smaller, more efficient, and powerful systems. This summary highlights the key changes and developments in the field, showcasing the advancements in multimodal capabilities and providing insights into the future of VLMs. New Model Trends Any-to-Any Models These innovative models can handle any combination of modalities—image, text, and audio—as inputs and outputs. Chameleon by Meta initially pioneered this concept, though its image generation capabilities were not released. Alpha-VLLM’s Lumina-mGPT built on Chameleon by integrating image generation. Qwen 2.5 Omni stands out as the most advanced any-to-any model, utilizing a "Thinker-Talker" architecture for text and speech generation, respectively. MiniCPM-o 2.6 and Janus-Pro-7B by DeepSeek AI also excel in multimodal understanding and generation, demonstrating the trend toward more comprehensive multimodal models. Reasoning Models Reasoning models, initially developed for large language models, have expanded to VLMs. QVQ-72B-preview by Alibaba Qwen was an early but experimental offering, followed by the Moonshot AI team’s Kimi-VL-A3B-Thinking. This model uses MoonViT as the image encoder and a Mixture-of-Experts (MoE) decoder, optimized for long chain-of-thought tasks and agentic capabilities. It can process diverse inputs like long videos, PDFs, and screenshots, enhancing its utility in various applications. Smaller, Yet Capable Models The community is increasingly focusing on creating smaller, more efficient models. SmolVLM2, developed by Hugging Face, is a prime example, fitting within 500M parameters while excelling in video understanding and deployment on consumer devices. The HuggingSnap iPhone app demonstrates the practicality of these models. Google DeepMind’s gemma3-1b-it, with 32K token context windows and multilingual support, is another notable innovation, along with Qwen2.5-VL-3B-Instruct, which handles tasks from object detection to document understanding with up to 32K tokens. Advanced Architectures Mixture of Experts (MoEs) as Decoders MoE architectures, which dynamically select and activate the most relevant experts, have shown potential in enhancing VLM performance and efficiency. MoEs like Kimi-VL, MOE-LLaVA, and DeepSeek-VL2 reduce computational costs while maintaining high performance. Unlike dense models, MoEs can focus on the most relevant data segments, leading to faster inference and better resource utilization. Emerging Applications Vision-Language-Action Models (VLAs) VLMs are now being integrated into robotics, transforming into Vision-Language-Action (VLA) models. These models interpret images and text instructions to control robotic actions, such as folding laundry, clearing tables, and bagging groceries. Examples include π0 and π0-FAST by Physical Intelligence and NVIDIA’s GR00T N1, which can execute complex tasks based on vision and language inputs. These models leverage a combination of reasoning and real-time movement control, marking a significant leap in AI's application to physical environments. Specialized Capabilities Object Detection, Segmentation, and Counting VLMs are increasingly capable of performing detailed computer vision tasks. PaliGemma was among the first to tackle these tasks, outputting localization tokens for object detection and segmentation. Upgraded versions like PaliGemma 2 and Molmo by Allen AI have enhanced performance, enabling tasks like pointing to instances and counting objects. Qwen2.5-VL can detect, point to, and count objects, including UI elements, demonstrating the versatility of these models. Multimodal Safety Models Safety is a crucial aspect of deploying VLMs in production. Google’s ShieldGemma 2, an extension of the text-only ShieldGemma, filters images and content policies to ensure safe and compliant outputs. Meta’s Llama Guard 4, a densely pruned model from Llama 4 Scout, also supports both text-only and multimodal inference, effectively managing inputs and outputs from vision language models. Enhanced Retrieval Augmented Generation (RAG) Traditionally, processing complex documents like PDFs involved brittle parsing steps. Multimodal RAG addresses this by using document screenshot embeddings (DSE) and ColBERT-like models to bypass parsing and directly retrieve relevant pages. DSE models combine text and image encoders to produce a single vector per query, while ColBERT-like models, such as ColPali, capture more nuanced relationships by computing similarities for each text token and image patch. These advancements make document retrieval more accurate and cost-effective. Video Language Models Handling videos presents unique challenges, particularly due to the temporal relationship between frames and the high volume of data. Models like Meta’s LongVU and Qwen2.5VL have developed techniques to downsample and refine video frames based on text queries, maintaining the speed and context of real-life events. Gemma 3, by Google DeepMind, accepts video frames interleaved with timestamps, enhancing its performance in video understanding tasks. New Alignment Techniques Preference optimization, an alternative to fixed-label fine-tuning, has gained traction in VLM development. The trl library supports direct preference optimization (DPO), where the model is trained to generate responses aligned with user preferences. Datasets like RLAIF-V are formatted to facilitate this training, enhancing the model's ability to produce accurate and preferred outputs. New Benchmarks MMT-Bench MMT-Bench evaluates VLMs across a wide range of tasks, including 31,325 multi-choice visual questions from various scenarios. The benchmark assesses capabilities like OCR, visual recognition, and visual-language retrieval, providing a comprehensive evaluation tool. MMMU-Pro Enhanced from the original MMMU benchmark, MMMU-Pro tests advanced AI models' true understanding across multiple modalities. Featuring a vision-only input setting and increased candidate options, it mimics real-world conditions and provides a more rigorous assessment of model performance. Industry Insights and Company Profiles The rapid advancements in VLMs are driven by key players such as Meta, Google DeepMind, and Moonshot AI. Meta's Chameleon and Qwen 2.5 Omni have set standards for multimodal integration and efficiency. Google's efforts, particularly with gemma3 and ShieldGemma 2, highlight the importance of safety and long-context handling. Moonshot AI's Kimi-VL-A3B-Thinking is advancing reasoning capabilities, demonstrating the community's focus on creating models that can handle complex, agentic tasks. These developments indicate a shift toward more robust, versatile, and efficient VLMs, reflecting the growing demand for models that can operate seamlessly across various modalities and applications. The ongoing research and innovation in this field promise to bring even more sophisticated and reliable solutions in the near future, opening new horizons for AI in both consumer and industrial settings.