UniCon: Unified Context Network for Robust Active Speaker Detection

We introduce a new efficient framework, the Unified Context Network (UniCon),for robust active speaker detection (ASD). Traditional methods for ASD usuallyoperate on each candidate's pre-cropped face track separately and do notsufficiently consider the relationships among the candidates. This potentiallylimits performance, especially in challenging scenarios with low-resolutionfaces, multiple candidates, etc. Our solution is a novel, unified frameworkthat focuses on jointly modeling multiple types of contextual information:spatial context to indicate the position and scale of each candidate's face,relational context to capture the visual relationships among the candidates andcontrast audio-visual affinities with each other, and temporal context toaggregate long-term information and smooth out local uncertainties. Based onsuch information, our model optimizes all candidates in a unified process forrobust and reliable ASD. A thorough ablation study is performed on severalchallenging ASD benchmarks under different settings. In particular, our methodoutperforms the state-of-the-art by a large margin of about 15% mean AveragePrecision (mAP) absolute on two challenging subsets: one with three candidatespeakers, and the other with faces smaller than 64 pixels. Together, our UniConachieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% forthe first time on this challenging dataset at the time of submission. Projectwebsite: https://unicon-asd.github.io/.