WINGS: New Dual-Learner Architecture Prevents Text-Only Forgetting in Multimodal Language Models

Multimodal large language models (MLLMs) have revolutionized AI by combining the power of text and vision, enabling more interactive and intuitive applications. These models can interpret visuals, answer image-based questions, and engage in mixed-text-and-image dialogues, making them invaluable for sectors like education, content creation, and interactive assistance. However, a significant challenge arises when MLLMs are trained on mixed datasets: they often suffer from text-only forgetting. This phenomenon occurs because the model's attention gets diverted by visual tokens, leading to a degradation in purely textual task performance, such as reasoning, comprehension, and Q&A. To mitigate this issue, various strategies have been attempted, including reintroducing large quantities of text-only data and alternating between text-only and multimodal fine-tuning. Adapter layers and prompt-based tuning are also used, but these methods can be computationally expensive, complicate inference logic, or incompletely restore text understanding. The core problem lies in the model's attention mechanism, which shifts too heavily towards visual content when image tokens are present. Recognizing this bottleneck, researchers from Alibaba Group’s AI Business team and Nanjing University developed an innovative solution called WINGS (Weight-Informed Guidance System). This new architecture introduces two additional modules—visual and textual learners—into each layer of the MLLM. These learners operate alongside the main attention mechanism, resembling "wings" that extend the model's capabilities. A routing component intelligently manages the allocation of attention between these learners, ensuring that the model can dynamically balance its focus on visual and textual information. One of the key techniques in WINGS is Low-Rank Residual Attention (LoRRA). LoRRA keeps the computational overhead low while allowing the visual and textual learners to capture crucial modality-specific details. During the first stage of training, only the visual learners are active, aligning image features. In the second stage, both learners are co-trained with a router module that uses attention weights to distribute tasks. Each learner interacts with either the image or surrounding text through efficient attention blocks, and their outputs are merged with the main model's. This setup prevents visual tasks from overwhelming textual tasks, maintaining the model's proficiency in both domains. WINGS has shown impressive performance improvements in both text-only and multimodal tasks. On the MMLU (Multimodal Multiple Choice) dataset, it scored 60.53, a 9.70-point improvement over a similar baseline model. For the CMMLU dataset, WINGS achieved 69.82, surpassing the baseline by 9.36 points. In reasoning tasks, WINGS saw gains of 11.9 points on Race-High and 11.12 points on WSC (Winograd Schema Challenge). Multimodal benchmarks also benefited, with a 4.78-point improvement on MMMU-VAL and enhanced performance on the IIT benchmark for mixed text-and-image dialogues. The introduction of WINGS represents a significant step forward in addressing the challenge of multimodal training. By carefully managing attention and integrating specialized learners, the architecture maintains high text comprehension while effectively processing visual information. This balance is critical for the practical application of MLLMs, ensuring that they remain versatile and reliable in real-world scenarios. The research underscores the importance of thoughtful design in overcoming the limitations of existing models and paving the way for more generalizable AI systems. Industry insiders and experts have praised WINGS for its innovative approach and practical benefits. The architecture's efficiency and modality-awareness make it a promising solution for developing future MLLMs. Alibaba Group, known for its advancements in AI and machine learning, continues to push the boundaries of multimodal systems. Nanjing University, a leading institution in computer science, has also contributed significantly to this project, highlighting the value of academic-industry collaboration in AI research.

WINGS: New Dual-Learner Architecture Prevents Text-Only Forgetting in Multimodal Language Models

Related Links