From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation

Video encompasses both visual and auditory data, creating a perceptually richexperience where these two modalities complement each other. As such, videosare a valuable type of media for the investigation of the interplay betweenaudio and visual elements. Previous studies of audio-visual modalitiesprimarily focused on either audio-visual representation learning or generativemodeling of a modality conditioned on the other, creating a disconnect betweenthese two branches. A unified framework that learns representation andgenerates modalities has not been developed yet. In this work, we introduce anovel framework called Vision to Audio and Beyond (VAB) to bridge the gapbetween audio-visual representation learning and vision-to-audio generation.The key approach of VAB is that rather than working with raw video frames andaudio data, VAB performs representation learning and generative modeling withinlatent spaces. In particular, VAB uses a pre-trained audio tokenizer and animage encoder to obtain audio tokens and visual features, respectively. It thenperforms the pre-training task of visual-conditioned masked audio tokenprediction. This training strategy enables the model to engage in contextuallearning and simultaneous video-to-audio generation. After the pre-trainingphase, VAB employs the iterative-decoding approach to rapidly generate audiotokens conditioned on visual features. Since VAB is a unified model, itsbackbone can be fine-tuned for various audio-visual downstream tasks. Ourexperiments showcase the efficiency of VAB in producing high-quality audio fromvideo, and its capability to acquire semantic audio-visual features, leading tocompetitive results in audio-visual retrieval and classification.