Dense Regression Network for Video Grounding

We address the problem of video grounding from natural language queries. Thekey challenge in this task is that one training video might only contain a fewannotated starting/ending frames that can be used as positive examples formodel training. Most conventional approaches directly train a binary classifierusing such imbalance data, thus achieving inferior results. The key idea ofthis paper is to use the distances between the frame within the ground truthand the starting (ending) frame as dense supervisions to improve the videogrounding accuracy. Specifically, we design a novel dense regression network(DRN) to regress the distances from each frame to the starting (ending) frameof the video segment described by the query. We also propose a simple buteffective IoU regression head module to explicitly consider the localizationquality of the grounding results (i.e., the IoU between the predicted locationand the ground truth). Experimental results show that our approachsignificantly outperforms state-of-the-arts on three datasets (i.e.,Charades-STA, ActivityNet-Captions, and TACoS).