Stable Mean Teacher for Semi-supervised Video Action Detection

In this work, we focus on semi-supervised learning for video actiondetection. Video action detection requires spatiotemporal localization inaddition to classification, and a limited amount of labels makes the modelprone to unreliable predictions. We present Stable Mean Teacher, a simpleend-to-end teacher-based framework that benefits from improved and temporallyconsistent pseudo labels. It relies on a novel Error Recovery (EoR) module,which learns from students' mistakes on labeled samples and transfers thisknowledge to the teacher to improve pseudo labels for unlabeled samples.Moreover, existing spatiotemporal losses do not take temporal coherency intoaccount and are prone to temporal inconsistencies. To address this, we presentDifference of Pixels (DoP), a simple and novel constraint focused on temporalconsistency, leading to coherent temporal detections. We evaluate our approachon four different spatiotemporal detection benchmarks: UCF101-24, JHMDB21, AVA,and YouTube-VOS. Our approach outperforms the supervised baselines for actiondetection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3%on AVA. Using merely 10% and 20% of data, it provides competitive performancecompared to the supervised baseline trained on 100% annotations on UCF101-24and JHMDB21, respectively. We further evaluate its effectiveness on AVA forscaling to large-scale datasets and YouTube-VOS for video object segmentation,demonstrating its generalization capability to other tasks in the video domain.Code and models are publicly available.