V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundationmodels (FMs) is becoming a new paradigm in AI research. Their representativeand generative abilities learnt from vast amounts of data can be easily adaptedand transferred to a wide range of downstream tasks without extra training fromscratch. However, leveraging FMs in cross-modal generation remainsunder-researched when audio modality is involved. On the other hand,automatically generating semantically-relevant sound from visual input is animportant problem in cross-modal generation studies. To solve thisvision-to-audio (V2A) generation problem, existing methods tend to design andbuild complex systems from scratch using modestly sized datasets. In thispaper, we propose a lightweight solution to this problem by leveragingfoundation models, specifically CLIP, CLAP, and AudioLDM. We first investigatethe domain gap between the latent space of the visual CLIP and the auditoryCLAP models. Then we propose a simple yet effective mapper mechanism(V2A-Mapper) to bridge the domain gap by translating the visual input betweenCLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrainedaudio generative FM AudioLDM is adopted to produce high-fidelity andvisually-aligned sound. Compared to previous approaches, our method onlyrequires a quick training of the V2A-Mapper. We further analyze and conductextensive experiments on the choice of the V2A-Mapper and show that agenerative mapper is better at fidelity and variability (FD) while a regressionmapper is slightly better at relevance (CS). Both objective and subjectiveevaluation on two V2A datasets demonstrate the superiority of our proposedmethod compared to current state-of-the-art approaches - trained with 86% fewerparameters but achieving 53% and 19% improvement in FD and CS, respectively.