Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyondsingle-domain capabilities is essential to meet the demands for more versatileand efficient AI. However, previous omni-models have insufficiently exploredspeech, neglecting its integration with multi-modality. We introduce Lyra, anefficient MLLM that enhances multimodal abilities, including advancedlong-speech comprehension, sound understanding, cross-modality efficiency, andseamless speech interaction. To achieve efficiency and speech-centriccapabilities, Lyra employs three strategies: (1) leveraging existingopen-source large models and a proposed multi-modality LoRA to reduce trainingcosts and data requirements; (2) using a latent multi-modality regularizer andextractor to strengthen the relationship between speech and other modalities,thereby enhancing model performance; and (3) constructing a high-quality,extensive dataset that includes 1.5M multi-modal (language, vision, audio) datasamples and 12K long speech samples, enabling Lyra to handle complex longspeech inputs and achieve more robust omni-cognition. Compared to otheromni-methods, Lyra achieves state-of-the-art performance on variousvision-language, vision-speech, and speech-language benchmarks, while alsousing fewer computational resources and less training data.