Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Augmented reality devices have the potential to enhance human perception andenable other assistive functionalities in complex conversational environments.Effectively capturing the audio-visual context necessary for understandingthese social interactions first requires detecting and localizing the voiceactivities of the device wearer and the surrounding people. These tasks arechallenging due to their egocentric nature: the wearer's head motion may causemotion blur, surrounding people may appear in difficult viewing angles, andthere may be occlusions, visual clutter, audio noise, and bad lighting. Underthese conditions, previous state-of-the-art active speaker detection methods donot give satisfactory results. Instead, we tackle the problem from a newsetting using both video and multi-channel microphone array audio. We propose anovel end-to-end deep learning approach that is able to give robust voiceactivity detection and localization results. In contrast to previous methods,our method localizes active speakers from all possible directions on thesphere, even outside the camera's field of view, while simultaneously detectingthe device wearer's own voice activity. Our experiments show that the proposedmethod gives superior results, can run in real time, and is robust againstnoise and clutter.