HyperAI

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng
Release Date: 6/18/2025
Stream-Omni: Simultaneous Multimodal Interactions with Large
  Language-Vision-Speech Model
Abstract

The emergence of GPT-4o-like large multimodal models (LMMs) has raised theexploration of integrating text, vision, and speech modalities to support moreflexible multimodal interaction. Existing LMMs typically concatenaterepresentation of modalities along the sequence dimension and feed them into alarge language model (LLM) backbone. While sequence-dimension concatenation isstraightforward for modality integration, it often relies heavily onlarge-scale data to learn modality alignments. In this paper, we aim to modelthe relationships between modalities more purposefully, thereby achieving moreefficient and flexible modality alignments. To this end, we proposeStream-Omni, a large language-vision-speech model with efficient modalityalignments, which can simultaneously support interactions under variousmodality combinations. Stream-Omni employs LLM as the backbone and aligns thevision and speech to the text based on their relationships. For vision that issemantically complementary to text, Stream-Omni uses sequence-dimensionconcatenation to achieve vision-text alignment. For speech that is semanticallyconsistent with text, Stream-Omni introduces a CTC-based layer-dimensionmapping to achieve speech-text alignment. In this way, Stream-Omni can achievemodality alignments with less data (especially speech), enabling the transferof text capabilities to other modalities. Experiments on various benchmarksdemonstrate that Stream-Omni achieves strong performance on visualunderstanding, speech interaction, and vision-grounded speech interactiontasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneouslyprovide intermediate text outputs (such as ASR transcriptions and modelresponses) during speech interaction, offering users a comprehensive multimodalexperience.