Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Active speaker detection (ASD) in videos with multiple speakers is achallenging task as it requires learning effective audiovisual features andspatial-temporal correlations over long temporal windows. In this paper, wepresent SPELL, a novel spatial-temporal graph learning framework that can solvecomplex tasks such as ASD. To this end, each person in a video frame is firstencoded in a unique node for that frame. Nodes corresponding to a single personacross frames are connected to encode their temporal dynamics. Nodes within aframe are also connected to encode inter-person relationships. Thus, SPELLreduces ASD to a node classification task. Importantly, SPELL is able to reasonover long temporal contexts for all nodes without relying on computationallyexpensive fully connected graph neural networks. Through extensive experimentson the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-basedrepresentations can significantly improve the active speaker detectionperformance owing to its explicit spatial and temporal structure. SPELLoutperforms all previous state-of-the-art approaches while requiringsignificantly lower memory and computational resources. Our code is publiclyavailable at https://github.com/SRA2/SPELL