A Light Weight Model for Active Speaker Detection

Active speaker detection is a challenging task in audio-visual scenariounderstanding, which aims to detect who is speaking in one or more speakersscenarios. This task has received extensive attention as it is crucial inapplications such as speaker diarization, speaker tracking, and automatic videoediting. The existing studies try to improve performance by inputting multiplecandidate information and designing complex models. Although these methodsachieved outstanding performance, their high consumption of memory andcomputational power make them difficult to be applied in resource-limitedscenarios. Therefore, we construct a lightweight active speaker detectionarchitecture by reducing input candidates, splitting 2D and 3D convolutions foraudio-visual feature extraction, and applying gated recurrent unit (GRU) withlow computational complexity for cross-modal modeling. Experimental results onthe AVA-ActiveSpeaker dataset show that our framework achieves competitive mAPperformance (94.1% vs. 94.2%), while the resource costs are significantly lowerthan the state-of-the-art method, especially in model parameters (1.0M vs.22.5M, about 23x) and FLOPs (0.6G vs. 2.6G, about 4x). In addition, ourframework also performs well on the Columbia dataset showing good robustness.The code and model weights are available athttps://github.com/Junhua-Liao/Light-ASD.