2 months ago

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Hannan, Tanveer ; Islam, Md Mohaiminul ; Gu, Jindong ; Seidl, Thomas ; Bertasius, Gedas

Abstract

Large language models (LLMs) excel at retrieving information from lengthytext, but their vision-language counterparts (VLMs) face difficulties withhour-long videos, especially for temporal grounding. Specifically, these VLMsare constrained by frame limitations, often losing essential temporal detailsneeded for accurate event localization in extended video content. We proposeReVisionLLM, a recursive vision-language model designed to locate events inhour-long videos. Inspired by human search strategies, our model initiallytargets broad segments of interest, progressively revising its focus topinpoint exact temporal boundaries. Our model can seamlessly handle videos ofvastly different lengths, from minutes to hours. We also introduce ahierarchical training strategy that starts with short clips to capture distinctevents and progressively extends to longer videos. To our knowledge,ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos,outperforming previous state-of-the-art methods across multiple datasets by asignificant margin (+2.6% [email protected] on MAD). The code is available athttps://github.com/Tanveer81/ReVisionLLM.