Mining Inter-Video Proposal Relations for Video Object Detection

Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection. However, these approaches mainly exploit the intra-proposal relation within single video, while ignoring the intra-proposal relation among different videos, which can provide important discriminative cues for recognizing confusing objects. To address the limitation, we propose a novel Inter-Video Proposal Relation module. Based on a concise multi-level triplet selection scheme, this module can learn effective object representations via modeling relations of hard proposals among different videos. Moreover, we design a Hierarchical Video Relation Network (HVR-Net), by integrating intra-video and inter-video proposal relations in a hierarchical fashion. This design can progressively exploit both intra and inter contexts to boost video object detection. We examine our method on the large-scale video object detection benchmark, i.e., ImageNet VID, where HVR-Net achieves the SOTA results. Codes and models will be released afterwards.