21 days ago

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Caorui Li Yu Chen Yiyan Ji Jin Xu Zhenyu Cui Shihao Li Yuanxing Zhang Jiafu Tang Zhenghao Song Dingling Zhang

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstratedsubstantial potential in video understanding. However, existing benchmarks failto comprehensively evaluate synergistic reasoning capabilities across audio andvisual modalities, often neglecting either one of the modalities or integratingthem in a logically inconsistent manner. To bridge this gap, we introduceOmniVideoBench, a large-scale and rigorously designed benchmark dedicated toassessing synergistic audio-visual understanding, with a strong emphasis onmodality complementarity and logical consistency. Specifically, OmniVideoBenchcomprises 1000 high-quality question-answer(QA) pairs, each annotated withstep-by-step reasoning traces, derived from 628 diverse videos ranging fromseveral seconds to 30 minutes, and manually verified to guarantee completecorrectness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefullydesigned question types, covering temporal reasoning, spatial localization,counting, causal inference, summarization, and beyond, thereby capturing theessential challenges of video understanding. Evaluation of multiple MLLMs onOmniVideoBench reveals a pronounced gap between model performance and humanreasoning, with open-source models lagging significantly behind theirclosed-source counterparts, underscoring the inherent difficulty of genuineaudio-visual reasoning. We will release OmniVideoBench to foster thedevelopment of MLLMs with stronger and more generalizable reasoningcapabilities.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Caorui Li Yu Chen Yiyan Ji Jin Xu Zhenyu Cui Shihao Li Yuanxing Zhang Jiafu Tang Zhenghao Song Dingling Zhang32 more

Abstract

Build AI with AI

Hyper Newsletters

Caorui Li Yu Chen Yiyan Ji Jin Xu Zhenyu Cui Shihao Li Yuanxing Zhang Jiafu Tang Zhenghao Song Dingling Zhang