HyperAIHyperAI

Command Palette

Search for a command to run...

15 days ago

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding
  LLM

Abstract

Advancing machine intelligence requires developing the ability to perceiveacross multiple modalities, much as humans sense the world. We introduceOmniVinci, an initiative to build a strong, open-source, omni-modal LLM. Wecarefully study the design choices across model architecture and data curation.For model architecture, we present three key innovations: (i) OmniAlignNet forstrengthening alignment between vision and audio embeddings in a sharedomni-modal latent space; (ii) Temporal Embedding Grouping for capturingrelative temporal alignment between vision and audio signals; and (iii)Constrained Rotary Time Embedding for encoding absolute temporal information inomni-modal embeddings. We introduce a curation and synthesis pipeline thatgenerates 24M single-modal and omni-modal conversations. We find thatmodalities reinforce one another in both perception and reasoning. Our model,OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modalunderstanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), whileusing just 0.2T training tokens - a 6 times reduction compared toQwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstreamapplications spanning robotics, medical AI, and smart factory.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp