What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention

Egocentric action anticipation consists in understanding which objects thecamera wearer will interact with in the near future and which actions they willperform. We tackle the problem proposing an architecture able to anticipateactions at multiple temporal scales using two LSTMs to 1) summarize the past,and 2) formulate predictions about the future. The input video is processedconsidering three complimentary modalities: appearance (RGB), motion (opticalflow) and objects (object-based features). Modality-specific predictions arefused using a novel Modality ATTention (MATT) mechanism which learns to weighmodalities in an adaptive fashion. Extensive evaluations on two large-scalebenchmark datasets show that our method outperforms prior art by up to +7% onthe challenging EPIC-Kitchens dataset including more than 2500 actions, andgeneralizes to EGTEA Gaze+. Our approach is also shown to generalize to thetasks of early action recognition and action recognition. Our method is rankedfirst in the public leaderboard of the EPIC-Kitchens egocentric actionanticipation challenge 2019. Please see our web pages for code and examples:http://iplab.dmi.unict.it/rulstm - https://github.com/fpv-iplab/rulstm.