HyperAIHyperAI

Command Palette

Search for a command to run...

OmniVinci: Advancing Joint Visual-Audio-Text Understanding with Innovative Architecture and Data Curation for State-of-the-Art Omni-Modal LLM Performance

OmniVinci is a comprehensive research initiative by NVIDIA that advances the field of omni-modal large language models through innovative architecture and data curation. The model achieves state-of-the-art performance in joint visual, audio, and textual understanding, enabling deeper and more accurate perception across multiple modalities. The core of OmniVinci lies in three key architectural innovations. First, OmniAlignNet enhances alignment between vision and audio embeddings by projecting them into a shared omni-modal latent space, improving cross-modal coherence. Second, Temporal Embedding Grouping captures relative temporal relationships between visual and audio signals, allowing the model to understand the timing and sequence of events across modalities. Third, Constrained Rotary Time Embedding encodes absolute temporal information, ensuring precise temporal context in omni-modal representations. To support these advancements, the team developed a novel data curation and synthesis pipeline that generated 24 million single-modal and omni-modal conversations. This high-quality, diverse dataset enables the model to learn rich, context-aware interactions between modalities. The research reveals that modalities mutually reinforce each other in both perception and reasoning, leading to more robust and accurate understanding. The OmniVinci model, with 9 billion parameters, outperforms Qwen2.5-Omni across multiple benchmarks. It achieves a +19.05 improvement on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), all while using only 0.2 trillion training tokens—six times less than the 1.2 trillion tokens used by Qwen2.5-Omni. This efficiency highlights the effectiveness of the proposed architecture and data strategy. The model demonstrates strong capabilities in real-world applications. For example, when prompted with a video of Jensen Huang speaking in a modern office, OmniVinci provides a detailed and accurate description of the content, including the speaker’s appearance, the setting, the message on a card, and the discussion on AI evolution and the development of the DGX-1 supercomputer. In another scenario, it narrates a complex visual sequence involving a robot receiving a gift, interpreting gestures, and reacting to a new device, showing a deep understanding of both visual and contextual cues. OmniVinci’s success extends to downstream applications in robotics, medical AI, and smart factory systems, where multi-modal understanding is critical. The model’s ability to integrate information from images, videos, audio, and text enables more intelligent and adaptive behavior in complex environments. The research is published on arXiv under the title "OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM" by Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Y.-C. Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, and Pavlo Molchanov. Researchers are encouraged to cite this work when using it in their own studies.

Related Links