HyperAIHyperAI

Command Palette

Search for a command to run...

Capybara-OMNI: An Efficient Paradigm for Building OMNI-MODAL Language Models

Xingguang Ji Jiakang Wang Hongzhi Zhang Jingyuan Zhang Haonan Zhou Chenxi Sun Yahui Liu Qi Wang Fuzheng Zhang

One-Click Deployment of Unified Visual Creation Model Capybara

Go to Notebook

Abstract

With the development of Multimodal Large Language Models (MLLMs), numerous outstanding accomplishments have emerged within the open-source community. Due to the complexity of creating and training multimodal data pairs, it is still a computational and time-consuming process to build powerful MLMs. In this work, we introduce Capybara-OMNI, an MLM that trains in a lightweight and efficient manner and supports understanding text, image, video, and audio modalities. We present in detail the framework design, the data construction, and the training recipe, to develop an MLM step-by-step to obtain competitive performance. We also provide exclusive benchmark utilized in our experiments to show how to properly verify understanding capabilities across different modalities. Results show that by following our guidance, we can efficiently build an MLM that achieves competitive performance among models of the same scale on various multimodal benchmarks. Additionally, to enhance the multimodal instruction-following and conversational capabilities of the model, we further discuss how to train the chat version upon an MLM understanding model, which is more in line with user habits for tasks like real-time interaction with humans. We publicly disclose the Capybara-OMNI model, along with its chat-based version.

One-sentence Summary

The authors introduce Capybara-OMNI, an efficient paradigm for building omni-modal language models that supports understanding text, image, video, and audio modalities through a lightweight training recipe, framework design, and data construction to achieve competitive performance among models of the same scale on various multimodal benchmarks, while publicly disclosing both the base model and a chat-based version for real-time interaction with humans.

Key Contributions

  • This work introduces Capybara-OMNI, an omni-modal language model that supports text, image, video, and audio understanding through a lightweight and efficient training process.
  • The paper details a framework design and training recipe that enhances instruction-following capabilities, enabling the development of a chat-based version for real-time interaction.
  • Experimental results demonstrate competitive performance on various multimodal benchmarks, utilizing an exclusive benchmark provided to verify understanding capabilities across different modalities.

Introduction

Multimodal Large Language Models are critical for advancing human-computer interaction by enabling systems to process text, images, video, and audio simultaneously. Despite their potential, training these models remains computationally expensive due to the complexity of aligning diverse data modalities. Prior work often faces challenges with mutual interference where adding audio capabilities substantially degrades visual understanding without careful design. The authors introduce Capybara-OMNI to resolve these issues through a lightweight training paradigm that supports full-modality input. Their approach utilizes optimized data construction and training recipes to achieve competitive performance across modalities while significantly reducing resource requirements.

Method

The authors construct the Capybara-OMNI model based on the Capybara-VL-7B architecture, integrating visual, audio, and text modalities into a unified framework. Refer to the framework diagram for the detailed system structure. The visual module employs a SigLIP vision encoder connected to the Large Language Model (LLM) via a two-layer MLP adapter. To accommodate variable aspect ratios and high-resolution images, the system interpolates position embeddings and segments images into sub-images, reducing visual tokens from 1024 to 256 through 2×22\times22×2 bilinear interpolation. The audio module utilizes an encoder initialized from Whisper-large-v3, projecting audio features into the LLM space via a single-layer MLP. The core LLM is Qwen2.5-7B, which processes the unified token sequences.

Capybara-OMNI Architecture Overview
Capybara-OMNI Architecture Overview

The training process is partitioned into three phases: visual alignment, audio alignment, and cross-modal instruction tuning. As shown in the figure below, the authors utilize specific freezing strategies in each phase to manage parameter updates effectively.

Three-Phase Training Strategy
Three-Phase Training Strategy

In the visual alignment phase, the model acquires image and video understanding capabilities. The training data construction process is illustrated in the figure below. The authors collect approximately 12 million image samples and 2.4 million video samples from diverse sources. This phase consists of three sub-stages. Initially, the LLM and vision encoder are frozen while the adapter is trained for coarse-grained alignment. Subsequent stages unfreeze all parameters to refine fine-grained visual concepts and handle high-resolution inputs split into up to nine sub-images.

Training Data Construction Pipeline
Training Data Construction Pipeline

During the audio alignment phase, the focus shifts to speech comprehension while preserving visual performance. Following the Freeze-Omni strategy, the LLM and visual components remain frozen. Only the audio encoder and adapter are trained on approximately 1.4 million ASR and S2TT data instances. This approach prevents catastrophic forgetting of the visual capabilities learned in the previous stage.

Finally, the cross-modal instruction tuning phase integrates all modalities for complex interaction. The authors generate synthetic data using GPT-4o and convert text-based VQA data into audio using Text-to-Speech (TTS). In this final stage, all model parameters are activated and updated to optimize cross-modal understanding and dialogue capabilities.

Experiment

Capybara-OMNI's multimodal understanding is evaluated across image, video, and audio tasks using standard open-source benchmarks to ensure realistic comparisons. The model demonstrates competitive performance in visual domains, frequently surpassing larger open-source models and rivaling closed-source systems, while video results underscore the efficacy of its training strategies in retaining capabilities. Additionally, audio experiments confirm strong performance against specialized models, with ablation studies validating that high-quality encoder initialization and data screening significantly improve audio understanding.

The the the table compares Capybara-OMNI against various private and open-source models across multiple video understanding benchmarks. Results indicate that the proposed model achieves competitive performance, frequently surpassing similarly sized open-source models and previous omni architectures. Capybara-OMNI outperforms comparable open-source models on Video-MME and MVbench. The model achieves superior scores on PerceptionTest compared to larger 72B models. Performance exceeds previous omni models like VITA1.5 across all listed metrics.

Capybara-OMNI video benchmark performance results
Capybara-OMNI video benchmark performance results

The the the table compares Capybara-OMNI against various private and open-source vision-language models across eight diverse benchmarks. Results indicate that the proposed model achieves competitive average performance, notably surpassing several larger open-source models and closed-source baselines. It demonstrates particular strength in scientific diagram understanding and mathematical reasoning, often matching the performance of significantly larger parameter models. Capybara-OMNI outperforms GPT4o-mini and LLaVA-OV-72B in overall average score. The model achieves top-tier results on AI2D and MathVista benchmarks. Performance remains competitive against much larger 72B and 76B parameter models.

Capybara-OMNI image understanding benchmark results
Capybara-OMNI image understanding benchmark results

The the the table evaluates audio understanding capabilities across Chinese and English ASR tasks and speech-to-text translation. Capybara-OMNI demonstrates competitive performance, particularly in Chinese ASR where it outperforms specialized models like GLM-4-Voice. While slightly trailing behind the initialization model Qwen2-Audio, it surpasses other open-source omni models in several English benchmarks. Capybara-OMNI achieves top-tier performance on the Aishell-1 Chinese ASR benchmark. The model outperforms other omni competitors like VITA-1.5 on English LibriSpeech tasks. Results indicate strong competitiveness in speech-to-text translation compared to open-source alternatives.

Capybara-OMNI audio understanding evaluation results
Capybara-OMNI audio understanding evaluation results

The the the table illustrates the progressive enhancement of audio understanding capabilities through specific architectural and data modifications. Initializing the model with the Qwen2-Audio encoder yields substantial performance gains over the baseline. Further improvements are observed after applying data augmentation, which lowers error rates across multiple benchmarks. Initializing with Qwen2-Audio encoder yields substantial improvements over the baseline. Data augmentation further enhances audio capabilities across all tested benchmarks. The final configuration achieves the best performance on both ASR and translation tasks.

Ablation study on audio understanding components
Ablation study on audio understanding components

The experiments evaluate Capybara-OMNI across video, image, and audio understanding benchmarks against a range of private and open-source baselines. Results indicate the model achieves competitive performance in video and image tasks, often surpassing similarly sized open-source architectures and matching larger models in scientific and mathematical reasoning. Audio assessments show strong capabilities in speech recognition and translation, while ablation studies validate that initializing with a specialized encoder and applying data augmentation significantly enhance these features.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp