Integrating Human Gaze into Attention for Egocentric Activity Recognition

It is well known that human gaze carries significant information about visualattention. However, there are three main difficulties in incorporating the gazedata in an attention mechanism of deep neural networks: 1) the gaze fixationpoints are likely to have measurement errors due to blinking and rapid eyemovements; 2) it is unclear when and how much the gaze data is correlated withvisual attention; and 3) gaze data is not always available in many real-worldsituations. In this work, we introduce an effective probabilistic approach tointegrate human gaze into spatiotemporal attention for egocentric activityrecognition. Specifically, we represent the locations of gaze fixation pointsas structured discrete latent variables to model their uncertainties. Inaddition, we model the distribution of gaze fixations using a variationalmethod. The gaze distribution is learned during the training process so thatthe ground-truth annotations of gaze locations are no longer needed in testingsituations since they are predicted from the learned gaze distribution. Thepredicted gaze locations are used to provide informative attentional cues toimprove the recognition performance. Our method outperforms all the previousstate-of-the-art approaches on EGTEA, which is a large-scale dataset foregocentric activity recognition provided with gaze measurements. We alsoperform an ablation study and qualitative analysis to demonstrate that ourattention mechanism is effective.