HyperAI

Is Extending Modality The Right Path Towards Omni-Modality?

Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su
Date de publication: 6/9/2025
Is Extending Modality The Right Path Towards Omni-Modality?
Résumé

Omni-modal language models (OLMs) aim to integrate and reason over diverseinput modalities--such as text, images, video, and audio--while maintainingstrong language capabilities. Despite recent advancements, existing models,especially open-source ones, remain far from true omni-modality, struggling togeneralize beyond the specific modality pairs they are trained on or to achievestrong performance when processing multi-modal inputs. We study the effect ofextending modality, the dominant technique for training multimodal models,where an off-the-shelf language model is fine-tuned on target-domain andlanguage data. Specifically, we investigate three key questions: (1) Doesmodality extension compromise core language abilities? (2) Can model mergingeffectively integrate independently fine-tuned modality-specific models toachieve omni-modality? (3) Does omni-modality extension lead to betterknowledge sharing and generalization compared to sequential extension? Throughextensive experiments, we analyze these trade-offs and provide insights intothe feasibility of achieving true omni-modality using current approaches.