HyperAI

Free the Original Artist! Wav2Lip Uses AI to Listen to the Sound and Synchronize the Character's Lip Movements

4 years ago
Headlines
True Nerve
神经小兮
特色图像

"seeing is believing"AI technology has become ineffective. Face-changing and lip-syncing technologies are emerging one after another, and the effects are becoming more and more realistic. Wav2Lip The model only needs an original video and a target audio to combine them into one.

In recent years, Hollywood animations have repeatedly won more than $1 billion in box office, such as "Zootopia" and "Frozen", all of which are of excellent quality.Just taking the lip movements as an example, it is very rigorous, and the lip movements of animated characters are almost identical to those of real people.

To achieve such an effect, a very complicated process is required, which consumes huge manpower and material resources. Therefore, many animation producers only use relatively simple lip movements to save costs.

Now, AI is working to ease the work of original artists. A team from the University of Hyderabad in India and the University of Bath in the UK published a paper at ACM MM2020 this year."A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild",An AI model called Wav2Lip is proposed, which only needs a video of a person and a target voice to combine the two into one, making the two work together seamlessly.

Lip syncing technique Wav2Lip, the effect is so outstanding

There are actually many technologies for lip syncing. Even before the emergence of deep learning-based technology, there were some technologies that matched the character's lip shape with the actual voice signal.

But this Wav2Lip shows an absolute advantage among all the methods. Other existing methods are mainly based on static images to output lip-synced videos that match the target voice, but lip-syncing often does not work well for dynamic, speaking characters.

Wav2Lip can directly perform lip conversion on dynamic videos and output video results that match the target voice.

In addition, not only videos but also lip syncing to animated pictures will be available, thus enriching your emoticon packages!

The author selected some animated images to match the lip movements

Manual evaluation showed,Compared to existing methods, videos generated by Wav2Lip ,outperform existing methods more than 90% of the time.

How effective is the model? Super Neuro conducted some tests. The following video shows the running effect of the official demo. The input materials are the test materials provided by the official website, as well as the Chinese and English test materials selected by Super Neuro.

The characters in the original video input are not speaking

Through AI model operation, the character's lip shape is synchronized with the input voice

We can see that the effect is perfect in the official demo animation video. In the super neural real person test, except for the slight deformation and shaking of the lips, the overall lip sync effect is still relatively accurate.

Tutorial is out, learn in three minutes

After reading this, are you eager to try it? If you already have a bold idea, why not start now?

Currently, the project has been open sourced on GitHub, and the author provides interactive demonstrations, Colab notebooks, and complete training code, inference code, pre-trained models, and tutorials.

The project details are as follows:

Project Name:Wav2Lip

GitHub address:

https://github.com/Rudrabha/Wav2Lip

Project operating environment:

  • Language: Python 3.6+
  • Video processing program: ffmpeg

Download the pre-trained face detection model:

https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth

In addition to preparing the above environment, you also need to download and install the following software packages:

  • librosa==0.7.0
  • numpy==1.17.1
  • opencv-contrib-python>=4.2.0.34
  • opencv-python==4.1.0.25
  • tensorflow==1.12.0
  • torch==1.1.0
  • torchvision==0.3.0
  • tqdm==4.45.0
  • numba==0.48

However, you don’t have to prepare for these cumbersome procedures.You only need to prepare a picture/a video of a person (CGI person is also OK) + an audio (synthetic audio is also OK).You can run it with just one click on the domestic machine learning computing power container service platform.

Portal:https://openbayes.com/console/openbayes/containers/EiBlCZyh7k7 

Currently, the platform also offers free vGPU usage time every week, so everyone can easily complete the tutorial.

The entire tutorial only takes 3-5 minutes to complete

The model has three weights: Wav2Lip, Wav2Lip+GAN, and Expert Discriminator. The latter two are significantly better than the Wav2Lip model alone. The weight used in this tutorial is Wav2Lip+GAN.

The model authors emphasize thatAll results of its open source code should only be used for research/academic/personal purposes,The model is trained based on the LRS2 (Lip Reading Sentences 2) dataset, so any form of commercial use is strictly prohibited.

To avoid abuse of the technology, the researchers also strongly recommend that any content created using Wav2Lip's code and models must be marked as synthetic.

The key technology behind it: lip sync discriminator

How does Wav2Lip listen to audio and lip-sync so accurately?

It is said that the key to achieving a breakthrough is:The researchers used a lip sync discriminator,This forces the generator to continuously produce accurate and realistic lip movements.

In addition, this study improves the visual quality by using multiple consecutive frames instead of a single frame in the discriminator and using a visual quality loss (rather than just a contrastive loss) to account for temporal correlations.

The authors use a pre-trained discriminator,Detect lip sync errors promptly and accurately

The researchers said,Their Wav2Lip model is almost universal, applicable to any face, any voice, any language, and can achieve high accuracy for any video.It can be seamlessly merged with the original video, can also be used to convert animated faces, and importing synthesized speech is also feasible.

It is conceivable that this artifact may create another wave of ghost videos...

Paper address:

https://arxiv.org/pdf/2008.10010.pdf

Demo address:

https://bhaasha.iiit.ac.in/lipsync/

-- over--