MAAS: Multi-modal Assignation for Active Speaker Detection

Active speaker detection requires a solid integration of multi-modal cues.While individual modalities can approximate a solution, accurate predictionscan only be achieved by explicitly fusing the audio and visual features andmodeling their temporal progression. Despite its inherent muti-modal nature,current methods still focus on modeling and fusing short-term audiovisualfeatures for individual speakers, often at frame level. In this paper wepresent a novel approach to active speaker detection that directly addressesthe multi-modal nature of the problem, and provides a straightforward strategywhere independent visual features from potential speakers in the scene areassigned to a previously detected speech event. Our experiments show that, ansmall graph data structure built from a single frame, allows to approximate aninstantaneous audio-visual assignment problem. Moreover, the temporal extensionof this initial graph achieves a new state-of-the-art on the AVA-ActiveSpeakerdataset with a mAP of 88.8\%.