A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

In this work, we investigate the problem of lip-syncing a talking face videoof an arbitrary identity to match a target speech segment. Current works excelat producing accurate lip movements on a static image or videos of specificpeople seen during the training phase. However, they fail to accurately morphthe lip movements of arbitrary identities in dynamic, unconstrained talkingface videos, resulting in significant parts of the video being out-of-sync withthe new audio. We identify key reasons pertaining to this and hence resolvethem by learning from a powerful lip-sync discriminator. Next, we propose new,rigorous evaluation benchmarks and metrics to accurately measure lipsynchronization in unconstrained videos. Extensive quantitative evaluations onour challenging benchmarks show that the lip-sync accuracy of the videosgenerated by our Wav2Lip model is almost as good as real synced videos. Weprovide a demo video clearly showing the substantial impact of our Wav2Lipmodel and evaluation benchmarks on our website:\url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild}.The code and models are released at this GitHub repository:\url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo atthis link: \url{bhaasha.iiit.ac.in/lipsync}.