HyperAIHyperAI
2 months ago

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Kazakos, Evangelos ; Nagrani, Arsha ; Zisserman, Andrew ; Damen, Dima
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action
  Recognition
Abstract

We focus on multi-modal fusion for egocentric action recognition, and proposea novel architecture for multi-modal temporal-binding, i.e. the combination ofmodalities within a range of temporal offsets. We train the architecture withthree modalities -- RGB, Flow and Audio -- and combine them with mid-levelfusion alongside sparse temporal sampling of fused representations. In contrastwith previous works, modalities are fused before temporal aggregation, withshared modality and fusion weights over time. Our proposed architecture istrained end-to-end, outperforming individual modalities as well as late-fusionof modalities. We demonstrate the importance of audio in egocentric vision, on per-classbasis, for identifying actions as well as interacting objects. Our methodachieves state of the art results on both the seen and unseen test sets of thelargest egocentric dataset: EPIC-Kitchens, on all metrics using the publicleaderboard.

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition | Latest Papers | HyperAI