Question-Answering Dense Video Events

This paper presents question-answering on dense video events, a novel taskthat answers and grounds dense-event questions in long videos, thus challengingMLLMs to faithfully comprehend and reason about multiple events over extendedperiods of time. To facilitate the study, we construct DeVE-QA -- a datasetfeaturing 78K questions about 26K events on 10.6K long videos. Our benchmarkingshows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, wepropose DeVi, a novel training-free MLLM approach that highlights ahierarchical captioning module, a temporal event memory module, and aself-consistency checking module to respectively detect, contextualize andmemorize, and ground dense-events in long videos for question answering.Extensive experiments show that DeVi is superior at answering dense-eventquestions and grounding relevant video moments. Compared with existing MLLMs,it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy onDeVE-QA and NExT-GQA, respectively. Data and code are available athttps://github.com/QHUni/DeVE-QA.