RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Locating specific moments within long videos (20-120 minutes) presents asignificant challenge, akin to finding a needle in a haystack. Adaptingexisting short video (5-30 seconds) grounding methods to this problem yieldspoor performance. Since most real life videos, such as those on YouTube andAR/VR, are lengthy, addressing this issue is crucial. Existing methodstypically operate in two stages: clip retrieval and grounding. However, thisdisjoint process limits the retrieval module's fine-grained eventunderstanding, crucial for specific moment detection. We propose RGNet whichdeeply integrates clip retrieval and grounding into a single network capable ofprocessing long videos into multiple granular levels, e.g., clips and frames.Its core component is a novel transformer encoder, RG-Encoder, that unifies thetwo stages through shared features and mutual optimization. The encoderincorporates a sparse attention mechanism and an attention loss to model bothgranularity jointly. Moreover, we introduce a contrastive clip samplingtechnique to mimic the long video paradigm closely during training. RGNetsurpasses prior methods, showcasing state-of-the-art performance on long videotemporal grounding (LVTG) datasets MAD and Ego4D.