4 months ago

Abstract

Deep convolutional networks have achieved great success for imagerecognition. However, for action recognition in videos, their advantage overtraditional methods is not so evident. We present a general and flexiblevideo-level framework for learning action models in videos. This method, calledtemporal segment network (TSN), aims to model long-range temporal structureswith a new segment-based sampling and aggregation module. This unique designenables our TSN to efficiently learn action models by using the whole actionvideos. The learned models could be easily adapted for action recognition inboth trimmed and untrimmed videos with simple average pooling and multi-scaletemporal window integration, respectively. We also study a series of goodpractices for the instantiation of TSN framework given limited trainingsamples. Our approach obtains the state-the-of-art performance on fourchallenging action recognition benchmarks: HMDB51 (71.0%), UCF101 (94.9%),THUMOS14 (80.1%), and ActivityNet v1.2 (89.6%). Using the proposed RGBdifference for motion models, our method can still achieve competitive accuracyon UCF101 (91.0%) while running at 340 FPS. Furthermore, based on the temporalsegment networks, we won the video classification track at the ActivityNetchallenge 2016 among 24 teams, which demonstrates the effectiveness of TSN andthe proposed good practices.

Source PDF