2 months ago

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman

Abstract

We present DenseAV, a novel dual encoder grounding architecture that learnshigh-resolution, semantically meaningful, and audio-visually aligned featuressolely through watching videos. We show that DenseAV can discover themeaning'' of words and thelocation'' of sounds without explicitlocalization supervision. Furthermore, it automatically discovers anddistinguishes between these two types of associations without supervision. Weshow that DenseAV's localization abilities arise from a new multi-head featureaggregation operator that directly compares dense image and audiorepresentations for contrastive learning. In contrast, many other systems thatlearn ``global'' audio and video representations cannot localize words andsound. Finally, we contribute two new datasets to improve the evaluation of AVrepresentations through speech and sound prompted semantic segmentation. Onthese and other datasets we show DenseAV dramatically outperforms the prior arton speech and sound prompted semantic segmentation. DenseAV outperforms theprevious state-of-the-art, ImageBind, on cross-modal retrieval using fewer thanhalf of the parameters. Project Page:https://aka.ms/denseav{https://aka.ms/denseav}