8 months ago

Abstract

In this work, we focus on semi-supervised learning for video action detectionwhich utilizes both labeled as well as unlabeled data. We propose a simpleend-to-end consistency based approach which effectively utilizes the unlabeleddata. Video action detection requires both, action class prediction as well asa spatio-temporal localization of actions. Therefore, we investigate two typesof constraints, classification consistency, and spatio-temporal consistency.The presence of predominant background and static regions in a video makes itchallenging to utilize spatio-temporal consistency for action detection. Toaddress this, we propose two novel regularization constraints forspatio-temporal consistency; 1) temporal coherency, and 2) gradient smoothness.Both these aspects exploit the temporal continuity of action in videos and arefound to be effective for utilizing unlabeled videos for action detection. Wedemonstrate the effectiveness of the proposed approach on two different actiondetection benchmark datasets, UCF101-24 and JHMDB-21. In addition, we also showthe effectiveness of the proposed approach for video object segmentation on theYoutube-VOS which demonstrates its generalization capability The proposedapproach achieves competitive performance by using merely 20% of annotations onUCF101-24 when compared with recent fully supervised methods. On UCF101-24, itimproves the score by +8.9% and +11% at 0.5 f-mAP and v-mAP respectively,compared to supervised approach.

Source PDF View Code