Active Speakers in Context

Current methods for active speak er detection focus on modeling short-termaudiovisual information from a single speaker. Although this strategy can beenough for addressing single-speaker scenarios, it prevents accurate detectionwhen the task is to identify who of many candidate speakers are talking. Thispaper introduces the Active Speaker Context, a novel representation that modelsrelationships between multiple speakers over long time horizons. Our ActiveSpeaker Context is designed to learn pairwise and temporal relations from anstructured ensemble of audio-visual observations. Our experiments show that astructured feature ensemble already benefits the active speaker detectionperformance. Moreover, we find that the proposed Active Speaker Contextimproves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAPof 87.1%. We present ablation studies that verify that this result is a directconsequence of our long-term multi-speaker analysis.