Command Palette
Search for a command to run...
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Abstract
Recent advances in multimodal large language models (MLLMs) have demonstratedsubstantial potential in video understanding. However, existing benchmarks failto comprehensively evaluate synergistic reasoning capabilities across audio andvisual modalities, often neglecting either one of the modalities or integratingthem in a logically inconsistent manner. To bridge this gap, we introduceOmniVideoBench, a large-scale and rigorously designed benchmark dedicated toassessing synergistic audio-visual understanding, with a strong emphasis onmodality complementarity and logical consistency. Specifically, OmniVideoBenchcomprises 1000 high-quality question-answer(QA) pairs, each annotated withstep-by-step reasoning traces, derived from 628 diverse videos ranging fromseveral seconds to 30 minutes, and manually verified to guarantee completecorrectness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefullydesigned question types, covering temporal reasoning, spatial localization,counting, causal inference, summarization, and beyond, thereby capturing theessential challenges of video understanding. Evaluation of multiple MLLMs onOmniVideoBench reveals a pronounced gap between model performance and humanreasoning, with open-source models lagging significantly behind theirclosed-source counterparts, underscoring the inherent difficulty of genuineaudio-visual reasoning. We will release OmniVideoBench to foster thedevelopment of MLLMs with stronger and more generalizable reasoningcapabilities.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.