WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos

Online action detection in untrimmed videos aims to identify an action as ithappens, which makes it very important for real-time applications. Previousmethods rely on tedious annotations of temporal action boundaries for training,which hinders the scalability of online action detection systems. We proposeWOAD, a weakly supervised framework that can be trained using only video-classlabels. WOAD contains two jointly-trained modules, i.e., temporal proposalgenerator (TPG) and online action recognizer (OAR). Supervised by video-classlabels, TPG works offline and targets at accurately mining pseudo frame-levellabels for OAR. With the supervisory signals from TPG, OAR learns to conductaction detection in an online fashion. Experimental results on THUMOS'14,ActivityNet1.2 and ActivityNet1.3 show that our weakly-supervised methodlargely outperforms weakly-supervised baselines and achieves comparableperformance to the previous strongly-supervised methods. Beyond that, WOAD isflexible to leverage strong supervision when it is available. When stronglysupervised, our method obtains the state-of-the-art results in the tasks ofboth online per-frame action recognition and online detection of action start.