Tell What You Hear From What You See -- Video to Audio Generation Through Text

The content of visual and audio scenes is multi-faceted such that a video canbe paired with various audio and vice-versa. Thereby, in video-to-audiogeneration task, it is imperative to introduce steering approaches forcontrolling the generated audio. While Video-to-Audio generation is awell-established generative task, existing methods lack such controllability.In this work, we propose VATT, a multi-modal generative framework that takes avideo and an optional text prompt as input, and generates audio and optionaltextual description of the audio. Such a framework has two advantages: i)Video-to-Audio generation process can be refined and controlled via text whichcomplements the context of visual information, and ii) The model can suggestwhat audio to generate for the video by generating audio captions. VATTconsists of two key modules: VATT Converter, a LLM that is fine-tuned forinstructions and includes a projection layer that maps video features to theLLM vector space; and VATT Audio, a transformer that generates audio tokensfrom visual frames and from optional text prompt using iterative paralleldecoding. The audio tokens are converted to a waveform by pretrained neuralcodec. Experiments show that when VATT is compared to existing video-to-audiogeneration methods in objective metrics, it achieves competitive performancewhen the audio caption is not provided. When the audio caption is provided as aprompt, VATT achieves even more refined performance (lowest KLD score of 1.41).Furthermore, subjective studies show that VATT Audio has been chosen aspreferred generated audio than audio generated by existing methods. VATTenables controllable video-to-audio generation through text as well assuggesting text prompts for videos through audio captions, unlocking novelapplications such as text-guided video-to-audio generation and video-to-audiocaptioning.