Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition

In this paper we propose an end-to-end trainable deep neural network modelfor egocentric activity recognition. Our model is built on the observation thategocentric activities are highly characterized by the objects and theirlocations in the video. Based on this, we develop a spatial attention mechanismthat enables the network to attend to regions containing objects that arecorrelated with the activity under consideration. We learn highly specializedattention maps for each frame using class-specific activations from a CNNpre-trained for generic image recognition, and use them for spatio-temporalencoding of the video with a convolutional LSTM. Our model is trained in aweakly supervised setting using raw video-level activity-class labels.Nonetheless, on standard egocentric activity benchmarks our model surpasses byup to +6% points recognition accuracy the currently best performing method thatleverages hand segmentation and object location strong supervision fortraining. We visually analyze attention maps generated by the network,revealing that the network successfully identifies the relevant objects presentin the video frames which may explain the strong recognition performance. Wealso discuss an extensive ablation analysis regarding the design choices.