Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale, long-form MAD and Ego4D datasetshas enabled researchers to investigate the performance of currentstate-of-the-art methods for video grounding in the long-form setup, withinteresting findings: current grounding methods alone fail at tackling thischallenging task and setup due to their inability to process long videosequences. In this paper, we propose a method for improving the performance ofnatural language grounding in long videos by identifying and pruning outnon-describable windows. We design a guided grounding framework consisting of aGuidance Model and a base grounding model. The Guidance Model emphasizesdescribable windows, while the base grounding model analyzes short temporalwindows to determine which segments accurately match a given language query. Weoffer two designs for the Guidance Model: Query-Agnostic and Query-Dependent,which balance efficiency and accuracy. Experiments demonstrate that ourproposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% inEgo4D (NLQ), respectively. Code, data and MAD's audio features necessary toreproduce our experiments are available at:https://github.com/waybarrios/guidance-based-video-grounding.