Video Face Manipulation Detection Through Ensemble of CNNs

In the last few years, several techniques for facial manipulation in videoshave been successfully developed and made available to the masses (i.e.,FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces invideo sequences with incredibly realistic results and a very little effort.Despite the usefulness of these tools in many fields, if used maliciously, theycan have a significantly bad impact on society (e.g., fake news spreading,cyber bullying through fake revenge porn). The ability of objectively detectingwhether a face has been manipulated in a video sequence is then a task ofutmost importance. In this paper, we tackle the problem of face manipulationdetection in video sequences targeting modern facial manipulation techniques.In particular, we study the ensembling of different trained ConvolutionalNeural Network (CNN) models. In the proposed solution, different models areobtained starting from a base network (i.e., EfficientNetB4) making use of twodifferent concepts: (i) attention layers; (ii) siamese training. We show thatcombining these networks leads to promising face manipulation detection resultson two publicly available datasets with more than 119000 videos.