Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

We present an audio-visual multimodal approach for the task of zeroshotlearning (ZSL) for classification and retrieval of videos. ZSL has been studiedextensively in the recent past but has primarily been limited to visualmodality and to images. We demonstrate that both audio and visual modalitiesare important for ZSL for videos. Since a dataset to study the task iscurrently not available, we also construct an appropriate multimodal datasetwith 33 classes containing 156,416 videos, from an existing large scale audioevent dataset. We empirically show that the performance improves by addingaudio modality for both tasks of zeroshot classification and retrieval, whenusing multimodal extensions of embedding learning methods. We also propose anovel method to predict the `dominant' modality using a jointly learnedmodality attention network. We learn the attention in a semi-supervised settingand thus do not require any additional explicit labelling for the modalities.We provide qualitative validation of the modality specific attention, whichalso successfully generalizes to unseen test classes.