Segmental Spatiotemporal CNNs for Fine-grained Action Segmentation

Joint segmentation and classification of fine-grained actions is importantfor applications of human-robot interaction, video surveillance, and humanskill evaluation. However, despite substantial recent progress in large-scaleaction classification, the performance of state-of-the-art fine-grained actionrecognition approaches remains low. We propose a model for action segmentationwhich combines low-level spatiotemporal features with a high-level segmentalclassifier. Our spatiotemporal CNN is comprised of a spatial component thatuses convolutional filters to capture information about objects and theirrelationships, and a temporal component that uses large 1D convolutionalfilters to capture information about how object relationships change acrosstime. These features are used in tandem with a semi-Markov model that modelstransitions from one action to another. We introduce an efficient constrainedsegmental inference algorithm for this model that is orders of magnitude fasterthan the current approach. We highlight the effectiveness of our SegmentalSpatiotemporal CNN on cooking and surgical action datasets for which we observesubstantially improved performance relative to recent baseline methods.