Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visualrecognition in still images. However, for action recognition in videos, theadvantage over traditional methods is not so evident. This paper aims todiscover the principles to design effective ConvNet architectures for actionrecognition in videos and learn these models given limited training samples.Our first contribution is temporal segment network (TSN), a novel framework forvideo-based action recognition. which is based on the idea of long-rangetemporal structure modeling. It combines a sparse temporal sampling strategyand video-level supervision to enable efficient and effective learning usingthe whole action video. The other contribution is our study on a series of goodpractices in learning ConvNets on video data with the help of temporal segmentnetwork. Our approach obtains the state-the-of-art performance on the datasetsof HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learnedConvNet models, which qualitatively demonstrates the effectiveness of temporalsegment network and the proposed good practices.