VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interval(or moment) semantically relevant to a language query. The solution to thischallenging task demands understanding videos' and queries' semantic contentand the fine-grained reasoning about their multi-modal interactions. Our keyidea is to recast this challenge into an algorithmic graph matching problem.Fueled by recent advances in Graph Neural Networks, we propose to leverageGraph Convolutional Networks to model video and textual information as well astheir semantic alignment. To enable the mutual exchange of information acrossthe modalities, we design a novel Video-Language Graph Matching Network(VLG-Net) to match video and query graphs. Core ingredients includerepresentation graphs built atop video snippets and query tokens separately andused to model intra-modality relationships. A Graph Matching layer is adoptedfor cross-modal context modeling and multi-modal fusion. Finally, momentcandidates are created using masked moment attention pooling by fusing themoment's enriched snippet features. We demonstrate superior performance overstate-of-the-art grounding methods on three widely used datasets for temporallocalization of moments in videos with language queries: ActivityNet-Captions,TACoS, and DiDeMo.