Command Palette
Search for a command to run...
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Abstract
Advancing machine intelligence requires developing the ability to perceiveacross multiple modalities, much as humans sense the world. We introduceOmniVinci, an initiative to build a strong, open-source, omni-modal LLM. Wecarefully study the design choices across model architecture and data curation.For model architecture, we present three key innovations: (i) OmniAlignNet forstrengthening alignment between vision and audio embeddings in a sharedomni-modal latent space; (ii) Temporal Embedding Grouping for capturingrelative temporal alignment between vision and audio signals; and (iii)Constrained Rotary Time Embedding for encoding absolute temporal information inomni-modal embeddings. We introduce a curation and synthesis pipeline thatgenerates 24M single-modal and omni-modal conversations. We find thatmodalities reinforce one another in both perception and reasoning. Our model,OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modalunderstanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), whileusing just 0.2T training tokens - a 6 times reduction compared toQwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstreamapplications spanning robotics, medical AI, and smart factory.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.