Temporally Aligned Audio for Video with Autoregression

We introduce V-AURA, the first autoregressive model to achieve high temporalalignment and relevance in video-to-audio generation. V-AURA uses ahigh-framerate visual feature extractor and a cross-modal audio-visual featurefusion strategy to capture fine-grained visual motion events and ensure precisetemporal alignment. Additionally, we propose VisualSound, a benchmark datasetwith high audio-visual relevance. VisualSound is based on VGGSound, a videodataset consisting of in-the-wild samples extracted from YouTube. During thecuration, we remove samples where auditory events are not aligned with thevisual ones. V-AURA outperforms current state-of-the-art models in temporalalignment and semantic relevance while maintaining comparable audio quality.Code, samples, VisualSound and models are available athttps://v-aura.notion.site