Read, Watch and Scream! Sound Generation from Text and Video

Despite the impressive progress of multimodal generative models,video-to-audio generation still suffers from limited performance and limits theflexibility to prioritize sound synthesis for specific objects within thescene. Conversely, text-to-audio generation methods generate high-quality audiobut pose challenges in ensuring comprehensive scene depiction and time-varyingcontrol. To tackle these challenges, we propose a novel video-and-text-to-audiogeneration method, called \ours, where video serves as a conditional controlfor a text-to-audio generation model. Especially, our method estimates thestructural information of sound (namely, energy) from the video while receivingkey content cues from a user prompt. We employ a well-performing text-to-audiomodel to consolidate the video control, which is much more efficient fortraining multimodal diffusion models with massive triplet-paired(audio-video-text) data. In addition, by separating the generative componentsof audio, it becomes a more flexible system that allows users to freely adjustthe energy, surrounding environment, and primary sound source according totheir preferences. Experimental results demonstrate that our method showssuperiority in terms of quality, controllability, and training efficiency. Codeand demo are available at https://naver-ai.github.io/rewas.