HyperAIHyperAI

Command Palette

Search for a command to run...

Capybara-OMNI : un paradigme efficace pour la construction de modèles de langage OMNI-MODAL

Xingguang Ji Jiakang Wang Hongzhi Zhang Jingyuan Zhang Haonan Zhou Chenxi Sun Yahui Liu Qi Wang Fuzheng Zhang

Déploiement en un clic du modèle de création visuelle unifié Capybara

Aller à Notebook

Résumé

Avec l'évolution des Multimodal Large Language Models (MLLMs), de nombreuses réalisations remarquables ont émergé au sein de la communauté open-source. Toutefois, en raison de la complexité inhérente à la création et à l'entraînement de paires de données multimodales, la construction de modèles MLM puissants demeure un processus exigeant en termes de ressources computationnelles et chronophage. Dans ce travail, nous présentons Capybara-OMNI, un MLM conçu pour s'entraîner de manière légère et efficace, tout en prenant en charge la compréhension des modalités textuelles, visuelles, vidéo et audio. Nous détaillons la conception de l'architecture, la construction des jeux de données ainsi que la recette d'entraînement, afin de développer un MLM de manière itérative et d'obtenir des performances compétitives. Nous fournissons également un benchmark exclusif utilisé dans nos expériences, permettant de vérifier correctement les capacités de compréhension à travers différentes modalités. Les résultats démontrent que, en suivant notre méthodologie, il est possible de construire efficacement un MLM atteignant des performances compétitives par rapport à des modèles de même échelle sur divers benchmarks multimodaux. Par ailleurs, afin d'améliorer les capacités du modèle en matière de suivi d'instructions multimodales et de conversation, nous discutons également des stratégies d'entraînement d'une version chat basée sur un modèle de compréhension MLM, mieux adaptée aux habitudes des utilisateurs pour des tâches telles que l'interaction en temps réel avec des humains. Nous rendons publiquement disponibles le modèle Capybara-OMNI ainsi que sa version conversationnelle.

One-sentence Summary

The authors introduce Capybara-OMNI, an efficient paradigm for building omni-modal language models that supports understanding text, image, video, and audio modalities through a lightweight training recipe, framework design, and data construction to achieve competitive performance among models of the same scale on various multimodal benchmarks, while publicly disclosing both the base model and a chat-based version for real-time interaction with humans.

Key Contributions

  • This work introduces Capybara-OMNI, an omni-modal language model that supports text, image, video, and audio understanding through a lightweight and efficient training process.
  • The paper details a framework design and training recipe that enhances instruction-following capabilities, enabling the development of a chat-based version for real-time interaction.
  • Experimental results demonstrate competitive performance on various multimodal benchmarks, utilizing an exclusive benchmark provided to verify understanding capabilities across different modalities.

Introduction

Multimodal Large Language Models are critical for advancing human-computer interaction by enabling systems to process text, images, video, and audio simultaneously. Despite their potential, training these models remains computationally expensive due to the complexity of aligning diverse data modalities. Prior work often faces challenges with mutual interference where adding audio capabilities substantially degrades visual understanding without careful design. The authors introduce Capybara-OMNI to resolve these issues through a lightweight training paradigm that supports full-modality input. Their approach utilizes optimized data construction and training recipes to achieve competitive performance across modalities while significantly reducing resource requirements.

Method

The authors construct the Capybara-OMNI model based on the Capybara-VL-7B architecture, integrating visual, audio, and text modalities into a unified framework. Refer to the framework diagram for the detailed system structure. The visual module employs a SigLIP vision encoder connected to the Large Language Model (LLM) via a two-layer MLP adapter. To accommodate variable aspect ratios and high-resolution images, the system interpolates position embeddings and segments images into sub-images, reducing visual tokens from 1024 to 256 through 2×22\times22×2 bilinear interpolation. The audio module utilizes an encoder initialized from Whisper-large-v3, projecting audio features into the LLM space via a single-layer MLP. The core LLM is Qwen2.5-7B, which processes the unified token sequences.

Capybara-OMNI Architecture Overview
Capybara-OMNI Architecture Overview

The training process is partitioned into three phases: visual alignment, audio alignment, and cross-modal instruction tuning. As shown in the figure below, the authors utilize specific freezing strategies in each phase to manage parameter updates effectively.

Three-Phase Training Strategy
Three-Phase Training Strategy

In the visual alignment phase, the model acquires image and video understanding capabilities. The training data construction process is illustrated in the figure below. The authors collect approximately 12 million image samples and 2.4 million video samples from diverse sources. This phase consists of three sub-stages. Initially, the LLM and vision encoder are frozen while the adapter is trained for coarse-grained alignment. Subsequent stages unfreeze all parameters to refine fine-grained visual concepts and handle high-resolution inputs split into up to nine sub-images.

Training Data Construction Pipeline
Training Data Construction Pipeline

During the audio alignment phase, the focus shifts to speech comprehension while preserving visual performance. Following the Freeze-Omni strategy, the LLM and visual components remain frozen. Only the audio encoder and adapter are trained on approximately 1.4 million ASR and S2TT data instances. This approach prevents catastrophic forgetting of the visual capabilities learned in the previous stage.

Finally, the cross-modal instruction tuning phase integrates all modalities for complex interaction. The authors generate synthetic data using GPT-4o and convert text-based VQA data into audio using Text-to-Speech (TTS). In this final stage, all model parameters are activated and updated to optimize cross-modal understanding and dialogue capabilities.

Experiment

Capybara-OMNI's multimodal understanding is evaluated across image, video, and audio tasks using standard open-source benchmarks to ensure realistic comparisons. The model demonstrates competitive performance in visual domains, frequently surpassing larger open-source models and rivaling closed-source systems, while video results underscore the efficacy of its training strategies in retaining capabilities. Additionally, audio experiments confirm strong performance against specialized models, with ablation studies validating that high-quality encoder initialization and data screening significantly improve audio understanding.

The the the table compares Capybara-OMNI against various private and open-source models across multiple video understanding benchmarks. Results indicate that the proposed model achieves competitive performance, frequently surpassing similarly sized open-source models and previous omni architectures. Capybara-OMNI outperforms comparable open-source models on Video-MME and MVbench. The model achieves superior scores on PerceptionTest compared to larger 72B models. Performance exceeds previous omni models like VITA1.5 across all listed metrics.

Capybara-OMNI video benchmark performance results
Capybara-OMNI video benchmark performance results

The the the table compares Capybara-OMNI against various private and open-source vision-language models across eight diverse benchmarks. Results indicate that the proposed model achieves competitive average performance, notably surpassing several larger open-source models and closed-source baselines. It demonstrates particular strength in scientific diagram understanding and mathematical reasoning, often matching the performance of significantly larger parameter models. Capybara-OMNI outperforms GPT4o-mini and LLaVA-OV-72B in overall average score. The model achieves top-tier results on AI2D and MathVista benchmarks. Performance remains competitive against much larger 72B and 76B parameter models.

Capybara-OMNI image understanding benchmark results
Capybara-OMNI image understanding benchmark results

The the the table evaluates audio understanding capabilities across Chinese and English ASR tasks and speech-to-text translation. Capybara-OMNI demonstrates competitive performance, particularly in Chinese ASR where it outperforms specialized models like GLM-4-Voice. While slightly trailing behind the initialization model Qwen2-Audio, it surpasses other open-source omni models in several English benchmarks. Capybara-OMNI achieves top-tier performance on the Aishell-1 Chinese ASR benchmark. The model outperforms other omni competitors like VITA-1.5 on English LibriSpeech tasks. Results indicate strong competitiveness in speech-to-text translation compared to open-source alternatives.

Capybara-OMNI audio understanding evaluation results
Capybara-OMNI audio understanding evaluation results

The the the table illustrates the progressive enhancement of audio understanding capabilities through specific architectural and data modifications. Initializing the model with the Qwen2-Audio encoder yields substantial performance gains over the baseline. Further improvements are observed after applying data augmentation, which lowers error rates across multiple benchmarks. Initializing with Qwen2-Audio encoder yields substantial improvements over the baseline. Data augmentation further enhances audio capabilities across all tested benchmarks. The final configuration achieves the best performance on both ASR and translation tasks.

Ablation study on audio understanding components
Ablation study on audio understanding components

The experiments evaluate Capybara-OMNI across video, image, and audio understanding benchmarks against a range of private and open-source baselines. Results indicate the model achieves competitive performance in video and image tasks, often surpassing similarly sized open-source architectures and matching larger models in scientific and mathematical reasoning. Audio assessments show strong capabilities in speech recognition and translation, while ablation studies validate that initializing with a specialized encoder and applying data augmentation significantly enhance these features.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp