HyperAI
Back to Headlines

Chinese Researchers Introduce Stream-Omni: An Advanced LLM for Real-Time Multimodal AI

2 days ago

Researchers from the University of Chinese Academy of Sciences have introduced Stream-Omni, a new large language-vision-speech model designed to overcome the challenges of modality alignment in cross-modal AI systems. Traditional multimodal models (LMMs) struggle with effectively integrating text, vision, and speech due to inherent representational discrepancies between these modalities. Current Limitations of Omni-Modal Architectures Most existing LMMs can be categorized into three types: vision-oriented, speech-oriented, and omni-modal. Vision-oriented models like LLaVA use vision encoders to process visual data and combine it with text inputs for generating textual outputs. Speech-oriented models, such as Mini-Omni and LLaMA-Omni, either project speech features into LLM embedding spaces or convert speech into discrete units for direct processing by LLMs. Omni-modal models like VITA-1.5, MiniCPM2.6-o, and Qwen2.5-Omni aim to handle all modalities but often rely heavily on large-scale tri-modal datasets, which are scarce and limit flexibility in producing intermediate text results during speech interactions. Introducing Stream-Omni Stream-Omni addresses these limitations by adopting a text-centric alignment approach. It uses an LLM backbone and integrates vision and speech modalities based on their semantic relationships with text. For vision, it employs sequence-dimension concatenation to align visual and textual data. For speech, it introduces a Connectionist Temporal Classification (CTC)-based layer-dimension mapping to align speech and text more effectively. Architecture Overview The architecture of Stream-Omni includes: - Vision-Text Alignment: A vision encoder and projection layer extract visual features, which are then concatenated with text sequences to align visual and textual data. - Speech-Text Alignment: Special speech layers are placed at the bottom and top of the LLM backbone to enable bidirectional mapping between speech and text. This helps in maintaining contextual coherence during speech interactions. - Training Data: The model is trained using a combination of LLaVA datasets for vision-text pairs, LibriSpeech and WenetSpeech for speech-text data, and the newly created InstructOmni dataset, which is generated by converting existing instruction datasets into speech using text-to-speech synthesis. Performance and Capabilities Visual Understanding: Stream-Omni performs comparably to advanced vision-oriented LMMs, outperforming VITA-1.5 by reducing modality interference while maintaining strong visual comprehension abilities. Speech Interaction: Despite using less speech data (23K hours), Stream-Omni demonstrates superior knowledge-based performance compared to discrete speech unit-based models like SpeechGPT and Moshi. Vision-Grounded Speech Interaction: On the SpokenVisIT benchmark, which tests real-world visual understanding, Stream-Omni exceeds the performance of VITA-1.5. ASR Performance: Stream-Omni achieves high accuracy and fast inference times on the LibriSpeech benchmark, highlighting its effectiveness in automatic speech recognition (ASR). Conclusion Stream-Omni represents a significant advancement in the field of omni-modal AI by introducing efficient modality alignment strategies based on semantic relationships. These methods eliminate the need for extensive tri-modal training data, addressing a major bottleneck in the development of multimodal systems. By demonstrating strong performance across diverse domains and modalities, Stream-Omni sets a new standard for real-time cross-modal AI and paves the way for more flexible and powerful future models. Industry Insights and Company Profiles Industry experts praise Stream-Omni for its innovative approach to modality alignment, viewing it as a crucial step forward in the development of versatile AI systems capable of handling multiple types of input in real-time applications. The research highlights the ongoing efforts to bridge the gap between different AI modalities, which is essential for creating more robust and adaptable AI solutions. The University of Chinese Academy of Sciences is known for its contributions to cutting-edge AI research, and this project further cements its reputation in the field. Stream-Omni’s open-source availability on platforms like Hugging Face encourages collaborative development and rapid iteration, fostering a community-driven approach to advancing AI technology.

Related Links