STAR-Net: Action Recognition using Spatio-Temporal Activation Reprojection

While depth cameras and inertial sensors have been frequently leveraged forhuman action recognition, these sensing modalities are impractical in manyscenarios where cost or environmental constraints prohibit their use. As such,there has been recent interest on human action recognition using low-cost,readily-available RGB cameras via deep convolutional neural networks. However,many of the deep convolutional neural networks proposed for action recognitionthus far have relied heavily on learning global appearance cues directly fromimaging data, resulting in highly complex network architectures that arecomputationally expensive and difficult to train. Motivated to reduce networkcomplexity and achieve higher performance, we introduce the concept ofspatio-temporal activation reprojection (STAR). More specifically, we reprojectthe spatio-temporal activations generated by human pose estimation layers inspace and time using a stack of 3D convolutions. Experimental results onUTD-MHAD and J-HMDB demonstrate that an end-to-end architecture based on theproposed STAR framework (which we nickname STAR-Net) is proficient insingle-environment and small-scale applications. On UTD-MHAD, STAR-Netoutperforms several methods using richer data modalities such as depth andinertial sensors.