2 months ago

Deep set conditioned latent representations for action recognition

Singh, Akash ; De Schepper, Tom ; Mets, Kevin ; Hellinckx, Peter ; Oramas, Jose ; Latre, Steven

Abstract

In recent years multi-label, multi-class video action recognition has gainedsignificant popularity. While reasoning over temporally connected atomicactions is mundane for intelligent species, standard artificial neural networks(ANN) still struggle to classify them. In the real world, atomic actions oftentemporally connect to form more complex composite actions. The challenge liesin recognising composite action of varying durations while other distinctcomposite or atomic actions occur in the background. Drawing upon the successof relational networks, we propose methods that learn to reason over thesemantic concept of objects and actions. We empirically show how ANNs benefitfrom pretraining, relational inductive biases and unordered set-based latentrepresentations. In this paper we propose deep set conditioned I3D (SCI3D), atwo stream relational network that employs latent representation of state andvisual representation for reasoning over events and actions. They learn toreason about temporally connected actions in order to identify all of them inthe video. The proposed method achieves an improvement of around 1.49% mAP inatomic action recognition and 17.57% mAP in composite action recognition, overa I3D-NL baseline, on the CATER dataset.