HyperAI

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Yu, Jiashuo ; Wu, Yue ; Chu, Meng ; Ren, Zhifei ; Huang, Zizheng ; Chu, Pei ; Zhang, Ruijie ; He, Yinan ; Li, Qirui ; Li, Songze ; Li, Zhenxiang ; Tu, Zhongying ; He, Conghui ; Qiao, Yu ; Wang, Yali ; Wang, Yi ; Wang, Limin
Date de publication: 6/15/2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Résumé

We present VRBench, the first long narrative video benchmark crafted forevaluating large models' multi-step reasoning capabilities, addressinglimitations in existing evaluations that overlook temporal reasoning andprocedural validity. It comprises 1,010 long videos (with an average durationof 1.6 hours), along with 9,468 human-labeled multi-step question-answeringpairs and 30,292 reasoning steps with timestamps. These videos are curated viaa multi-stage filtering process including expert inter-rater reviewing toprioritize plot coherence. We develop a human-AI collaborative framework thatgenerates coherent reasoning chains, each requiring multiple temporallygrounded steps, spanning seven types (e.g., event attribution, implicitinference). VRBench designs a multi-phase evaluation pipeline that assessesmodels at both the outcome and process levels. Apart from the MCQs for thefinal results, we propose a progress-level LLM-guided scoring metric toevaluate the quality of the reasoning chain from multiple dimensionscomprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs onVRBench, we undertake a thorough analysis and provide valuable insights thatadvance the field of multi-step reasoning.