HyperAIHyperAI

Command Palette

Search for a command to run...

VideoLLaMA 3:画像および動画理解における最先端マルチモーダル基礎モデル

概要

本稿では、画像および動画理解を目的としたより高度なマルチモーダル基礎モデル「VideoLLaMA3」を提案する。VideoLLaMA3の核心的な設計理念は「ビジョン中心(vision-centric)」である。ここでいう「ビジョン中心」とは、二つの側面を意味する:ビジョン中心の学習パラダイムとビジョン中心のフレームワーク設計である。本研究におけるビジョン中心の学習パラダイムの鍵となる洞察は、高品質な画像-テキストデータが、画像理解および動画理解の両方にとって極めて重要であるということである。膨大な動画-テキストデータセットを構築するのではなく、大規模かつ高品質な画像-テキストデータセットの構築に注力している。VideoLLaMA3の学習プロセスは以下の4段階から構成される。1) ビジョンエンコーダ適応(Vision Encoder Adaptation):画像の解像度が可変である場合にも対応できるように、ビジョンエンコーダを適応させる段階。2) ビジョン-言語統合学習(Vision-Language Alignment):複数の種類(シーン画像、文書、グラフなど)の画像-テキストデータおよびテキストのみのデータを用いて、大規模なデータセット上でビジョンエンコーダ、プロジェクタ、および大規模言語モデル(LLM)を共同で微調整する段階。3) マルチタスクファインチューニング(Multi-task Fine-tuning):下流タスク向けに画像-テキストのSFT(Supervised Fine-Tuning)データを統合し、動画理解の基盤を構築するために動画-テキストデータも導入する段階。4) 動画中心のファインチューニング(Video-centric Fine-tuning):動画理解におけるモデルの能力をさらに向上させる段階。フレームワーク設計に関しては、画像内の微細な詳細をより正確に捉えるために、事前学習済みのビジョンエンコーダを、固定数のビジョントークンに変換するのではなく、入力画像のサイズに応じて適切な数のビジョントークンに変換するように適応している。動画入力に対しては、各フレーム間の類似度に基づいてビジョントークンの数を削減することで、動画表現の精度とコンパクトさを両立している。このビジョン中心の設計により、VideoLLaMA3は画像理解および動画理解のベンチマークにおいて、非常に高い性能を達成している。

One-sentence Summary

The authors from DAMO Academy and Hupan Lab propose VideoLLaMA3, a vision-centric multimodal foundation model that achieves state-of-the-art performance in both image and video understanding by leveraging large-scale high-quality image-text data and a four-stage training paradigm. The model introduces Any-resolution Vision Tokenization and Differential Frame Pruning to enable flexible, high-fidelity visual representation and efficient video processing, significantly improving capabilities in document comprehension, mathematical reasoning, and long-form video analysis.

Key Contributions

  • VideoLLaMA3 introduces a vision-centric training paradigm that prioritizes image understanding to enhance video comprehension, leveraging high-quality image-text data to improve the vision encoder's robustness before focusing on temporal modeling in video tasks.
  • The model features two key vision-centric framework innovations: dynamic resolution input support via Rotary Position Embedding to handle variable aspect ratios and high-resolution images, and video token compression to reduce redundancy and improve computational efficiency.
  • VideoLLaMA3 achieves state-of-the-art performance on diverse benchmarks, excelling in both image understanding (e.g., chart and math reasoning) and video tasks (e.g., long-form video, temporal grounding), outperforming prior models across multiple metrics.

Introduction

The authors leverage the success of image-centric multimodal large language models (MLLMs) to address the challenges of video understanding, where temporal dynamics and low-quality, sparse video-text datasets hinder progress. Prior work often struggles with inefficient token handling, rigid input representations, and limited generalization due to reliance on scarce, noisy video data. To overcome these limitations, the authors introduce VideoLLaMA3, a vision-centric MLLM that first strengthens image understanding through a four-stage training paradigm—vision encoder adaptation, vision-language alignment, multi-task fine-tuning, and video-centric fine-tuning—using high-quality image-text data. The model incorporates two key technical innovations: dynamic resolution input via Rotary Position Embedding to handle variable aspect ratios and high-resolution images, and video token compression to reduce redundancy and improve computational efficiency. These design choices enable VideoLLaMA3 to achieve state-of-the-art performance on both image and video benchmarks, including document comprehension, mathematical reasoning, long video understanding, and temporal grounding, while maintaining strong generalization across modalities.

Dataset

  • The VL3-Syn7M dataset, used to train VideoLLaMA3, consists of 7 million image-caption pairs sourced from COYO-700M and processed through a multi-stage cleaning pipeline.
  • Key filtering steps include: aspect ratio filtering to remove extreme image shapes, aesthetic score filtering using a dedicated model to discard low-quality visuals, text-image similarity scoring via BLIP2 and CLIP to retain describable content, and visual feature clustering using CLIP features and k-NN to ensure semantic diversity and balanced category coverage.
  • After filtering, images undergo re-captioning: brief captions are generated with InternVL2-8B, and detailed captions with InternVL2-26B, producing two distinct subsets—VL3-Syn7M-short and VL3-Syn7M-detailed—used at different training stages.
  • The dataset is integrated into a multi-stage training framework:
    • In Vision Encoder Adaptation, VL3-Syn7M-short is combined with LLaVA-Pretrain-558K, Object365, and SA-1B to enhance scene understanding and fine-grained feature extraction.
    • In Vision-Language Alignment, VL3-Syn7M is augmented with COCO-2017, ShareGPT4o, ShareGPT4V, DenseFusion, and LLaVA-Recap, with recaptioning applied to enhance caption quality.
    • In Multi-task Fine-tuning, VL3-Syn7M supports general image, document, OCR, grounding, and multi-image tasks, with additional data from Pixmo, Cambrian-10M, and specialized datasets like Demon-Full and Contrastive-Caption.
    • In Video-centric Fine-tuning, VL3-Syn7M contributes to general image understanding, while video-specific data from LLaVA-Video, ShareGPT-4o, and synthetic dense captions from Panda-70M (via Qwen2-VL-72B) are used to strengthen temporal and spatial reasoning.
  • Data is formatted as token sequences: images use “\n” to separate tokens, videos include “Time: xxs” timestamps before each frame and commas to separate frames, and streaming videos use interleaved frame, timestamp, and answer tokens (e.g., “GPT: xxx”) to simulate real-time interaction.
  • For video training, long videos are segmented into two-minute clips based on dense caption intervals, and synthetic streaming conversations are constructed to support multi-turn understanding.
  • Temporal grounding data is converted into text format (e.g., “1.0-2.0 s”) and combined with datasets like ActivityNet, YouCook2, and Charades-STA to train the model on precise event localization.

Method

The authors leverage a vision-centric approach to design VideoLLaMA3, a multimodal foundation model for image and video understanding. The core of the model architecture consists of a vision encoder, a video compressor, a projector, and a large language model (LLM). The vision encoder, initialized with the pre-trained SigLIP, extracts visual features, while the projector bridges the representation gap between the vision encoder and the LLM. The LLM used is based on the Qwen2.5 architecture. To handle inputs of varying resolutions, the model employs Any-resolution Vision Tokenization (AVT), which adapts the vision encoder to process images and videos of any size by dynamically generating a corresponding number of vision tokens. This is achieved by replacing the absolute position embeddings in the Vision Transformer (ViT) with 2D-RoPE, enabling the encoder to maintain spatial relationships across different resolutions. For video inputs, the model further reduces the number of tokens through a video compressor, which is designed to eliminate redundant information.

As shown in the figure below, the model's architecture supports dynamic resolution input, where images of different dimensions are processed into vision tokens of variable lengths. This flexibility is crucial for preserving fine-grained details in images. For videos, the model first applies a per-frame 2×22\times22×2 spatial downsampling via bilinear interpolation to limit the context length. To further reduce redundancy, the Differential Frame Pruner (DiffFP) is employed. This component computes the 1-norm distance between temporally consecutive patches in the pixel space and prunes patches with distances below a threshold, effectively removing frames with minimal content change. This process results in a more compact and precise representation of the video input.

The training of VideoLLaMA3 is structured into four distinct stages. The first stage, Vision Encoder Adaptation, fine-tunes the vision encoder and projector on a large-scale image dataset, transforming the encoder into a dynamic-resolution processor. The second stage, Vision-Language Alignment, jointly fine-tunes the vision encoder, projector, and LLM using a diverse set of image-text and text-only data to integrate multimodal knowledge. The third stage, Multi-task Fine-tuning, performs instruction fine-tuning on a combination of image and video-based question-answering data, which enhances the model's ability to follow instructions and lays the foundation for video understanding. The final stage, Video-centric Fine-tuning, focuses on improving video understanding by training on video-text data, image-only data, and text-only data, with all model parameters unfrozen. This staged training process ensures that the model develops strong image understanding capabilities first, which are then leveraged to enhance its video understanding performance.

Experiment

  • VideoLLaMA3-2B and VideoLLaMA3-7B are evaluated on image and video benchmarks to validate their multi-modal understanding capabilities.
  • On image benchmarks, VideoLLaMA3-2B achieves 69.4% on InfoVQA, surpassing the previous best by 3.9%, and 59.2% on MathVista, outperforming prior methods by 7.9%. It also scores 67.3% on RealWorldQA, exceeding the prior state-of-the-art by 4.4%.
  • VideoLLaMA3-7B achieves 65.7% on MathVision, surpassing the previous best by 6.5%, and 67.3% on RealWorldQA, improving by 2.0% over prior models.
  • On video benchmarks, VideoLLaMA3-2B achieves the highest scores on VideoMME w/o sub (59.6%), VideoMME w/ sub (63.4%), ActivityNet-QA (58.2%), PerceptionTest-test (68.0%), and MVBench (65.5%), and leads on all long-video benchmarks: MLVU-dev (65.4%), LongVideoBench-val (57.1%), and LVBench (40.4%).
  • VideoLLaMA3-7B leads on 5 out of 7 general video understanding benchmarks and achieves top performance on MLVU-dev, with strong results on LongVideoBench-val and LVBench.
  • Ablation studies confirm SigLIP as the optimal vision encoder, outperforming CLIP and DFN, especially in text-rich and fine-grained understanding tasks.

The authors use VideoLLaMA3-7B to evaluate its performance on video benchmarks, comparing it against several state-of-the-art models. Results show that VideoLLaMA3-7B achieves the highest scores on multiple tasks, including MLVU-dev (73.0%), LongVideoBench-val (59.8%), and NextQA (84.5%), demonstrating strong capabilities in long video understanding and temporal reasoning.

Results show that VideoLLaMA3 achieves the highest scores on most benchmarks compared to other models. It outperforms baselines such as Qwen2-VL, LLaVA-Video, and InternVL2.5 across general video understanding, long video comprehension, and temporal reasoning tasks, demonstrating strong performance in both short and long video analysis.

The authors conduct an ablation study comparing three vision encoders—CLIP, DFN, and SigLIP—on a subset of the dataset, evaluating their performance across multiple benchmarks. Results show that SigLIP outperforms the other two encoders, particularly in fine-grained text understanding tasks, leading the authors to select it as the base vision encoder for further development.

The authors use the table to compare the performance of VideoLLaMA3-2B against several baselines on video understanding benchmarks. Results show that VideoLLaMA3-2B achieves the highest scores on multiple tasks, including VideoMME w/o sub (59.6%), VideoMME w/ sub (63.4%), and PerceptionTest-test (68.0%). It also leads in long video understanding and temporal reasoning, scoring 65.4% on MLVU-dev and 81.1% on NextQA, demonstrating strong performance across diverse video comprehension tasks.

The authors use Table 5 to evaluate the 2B model variants on image benchmarks, focusing on document/chart/scene text understanding, mathematical reasoning, multi-image understanding, and general knowledge QA. Results show that VideoLLaMA3-2B achieves the highest scores on several key tasks, including DocVQA test (91.9), OCRBench (779), and RealWorldQA (67.3), outperforming all listed baselines. It also demonstrates strong performance in mathematical reasoning, scoring 59.2 on MathVista, which is significantly higher than the second-best model.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています