Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Egocentric video recognition is a natural testbed for diverse interactionreasoning. Due to the large action vocabulary in egocentric video datasets,recent studies usually utilize a two-branch structure for action recognition,ie, one branch for verb classification and the other branch for nounclassification. However, correlation studies between the verb and the nounbranches have been largely ignored. Besides, the two branches fail to exploitlocal features due to the absence of a position-aware attention mechanism. Inthis paper, we propose a novel Symbiotic Attention framework leveragingPrivileged information (SAP) for egocentric video recognition. Finerposition-aware object detection features can facilitate the understanding ofactor's interaction with the object. We introduce these features in actionrecognition and regard them as privileged information. Our framework enablesmutual communication among the verb branch, the noun branch, and the privilegedinformation. This communication process not only injects local details intoglobal features but also exploits implicit guidance about the spatio-temporalposition of an on-going action. We introduce novel symbiotic attention (SA) toenable effective communication. It first normalizes the detection guidedfeatures on one branch to underline the action-relevant information from theother branch. SA adaptively enhances the interactions among the three sources.To further catalyze this communication, spatial relations are uncovered for theselection of most action-relevant information. It identifies the most valuableand discriminative feature for classification. We validate the effectiveness ofour SAP quantitatively and qualitatively. Notably, it achieves thestate-of-the-art on two large-scale egocentric video datasets.