HyperAI

Abstract

This report describes the approach underlying our submission to the active speaker detection task (task B-2) of ActivityNet Challenge 2019. We introduce a new audio-visual model which builds upon a 3D-ResNet18 visual model pretrained for lipreading and a VGG-M acoustic model pretrained for audio-to-video synchronization. The model is trained with two losses in a multi-task learning fashion: a contrastive loss to enforce matching between audio and video features for active speakers, and a regular crossentropy loss to obtain speaker / non-speaker labels. This model obtains 84.0% mAP on the validation set of AVAActiveSpeaker. Experimental results showcase the pretrained embeddings' abilities to transfer across tasks and data formats, as well as the advantage of the proposed multi-task learning strategy.

Abstract

Shiguang Shan Shuang Yang Jingyun Xiao Yuanhang Zhang

Abstract

Build AI with AI

HyperAI Newsletters

Shiguang Shan Shuang Yang Jingyun Xiao Yuanhang Zhang

Abstract

Build AI with AI

HyperAI Newsletters

Shiguang Shan Shuang Yang Jingyun Xiao Yuanhang Zhang

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Multi-Task Learning for Audio Visual Active Speaker Detection

Shiguang Shan Shuang Yang Jingyun Xiao Yuanhang Zhang

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Multi-Task Learning for Audio Visual Active Speaker Detection

Shiguang Shan Shuang Yang Jingyun Xiao Yuanhang Zhang

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Multi-Task Learning for Audio Visual Active Speaker Detection

Shiguang Shan Shuang Yang Jingyun Xiao Yuanhang Zhang

Abstract

Build AI with AI

HyperAI Newsletters