Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Video-to-audio (V2A) generation aims to synthesize content-matching audiofrom silent video, and it remains challenging to build V2A models with highgeneration quality, efficiency, and visual-audio temporal synchrony. We proposeFrieren, a V2A model based on rectified flow matching. Frieren regresses theconditional transport vector field from noise to spectrogram latent withstraight paths and conducts sampling by solving ODE, outperformingautoregressive and score-based models in terms of audio quality. By employing anon-autoregressive vector field estimator based on a feed-forward transformerand channel-level cross-modal feature fusion with strong temporal alignment,our model generates audio that is highly synchronized with the input video.Furthermore, through reflow and one-step distillation with guided vector field,our model can generate decent audio in a few, or even only one sampling step.Experiments indicate that Frieren achieves state-of-the-art performance in bothgeneration quality and temporal alignment on VGGSound, with alignment accuracyreaching 97.22%, and 6.2% improvement in inception score over the strongdiffusion-based baseline. Audio samples are available athttp://frieren-v2a.github.io.