HyperAI

Dataset Summary | DeepFake Chaos, Use Magic to Defeat Magic! High-quality Datasets Help the Development of Forgery Detection Technology

特色图像

With the rapid development of artificial intelligence technology, facial recognition technology has been widely used in security, payment, retail and other fields, greatly improving the convenience and safety of life. However, the double-edged sword characteristics of technology are gradually emerging, especially in terms of privacy protection, and the abuse of facial recognition technology has become the focus of social attention.

According to CCTV's 315 Gala, many well-known companies illegally collected and stored facial information without consumers' consent, generated unique IDs for subsequent business analysis and precision marketing. This behavior seriously violated consumers' privacy rights and aroused widespread social concern.

At the same time, AI-driven deep fake technology DeepFake is "mistaking the real for the fake", disrupting social order and infringing on the public interest. DeepFake uses massive training data to generate numerous fake photos, videos and audios, and the face-changing mode is so sophisticated that ordinary people can hardly find the subtle differences. Many criminals use this technology to make illegal profits. It is understood that the number of criminals in South Korea who use this technology for illegal profits is as high as 220,000.

Therefore, at the technical level, continuously upgrading face recognition and forgery detection technology to accurately judge these tampered DeepFake videos and images is a hot issue that needs to be solved urgently. This article will sort out and summarize the commonly used face recognition and DeepFake datasets, hoping to help researchers to carry out research in related fields more effectively to a certain extent.

Click to view more open source datasets:

https://go.hyper.ai/jpfrj

DeepFake/Face Recognition Dataset

1.Deepfake Detection Video Recognition Dataset

Publishing Platform:Kaggle

Release time:2024

Estimated size:22.5 GB

Download address:https://go.hyper.ai/B8dJf

The Deepfake Detection dataset is designed for the deepfake detection task and provides a comprehensive collection of video sequences that can be used to train and evaluate deep learning models for identifying manipulated media. It is downloaded from the official FaceForensics server, which specializes in providing high-quality datasets for face manipulation detection.

2.LAV-DF Multimodal Audio-Visual Dataset

Publishing Agency:Monash University, Curtin University, Indian Institute of Technology Ropar

Release time:2022

Estimated size:23.11 GB
Download address:https://go.hyper.ai/wTcYE

LAV-DF is a multimodal (video tampering and audio tampering) dataset derived from the VoxCeleb2 dataset, containing 136,304 videos, including 36,431 real videos and 99,873 fake videos.

3.OpenForensics face forgery detection dataset 

Publishing Agency:National Institute of Informatics, Japan; Sokendaigaku University, Japan; University of Tokyo

Release time:2021

Download address:https://go.hyper.ai/64Gn2

The OpenForensics dataset is a large-scale challenging dataset designed for multi-face forgery detection and segmentation tasks. The dataset consists of 115K wild images and 334K faces, all of which have rich facial annotations. It not only supports multi-face forgery detection and segmentation tasks, but also supports general tasks involving general faces. It has great potential for research on deep fake prevention and general human face detection.

4.ForgeryNet face forgery dataset 

Publishing Agency:SenseTime Research, Beijing University of Posts and Telecommunications, Shanghai Artificial Intelligence Laboratory, School of Software, Beihang University, University of Science and Technology of China, S-Lab, Nanyang Technological University

Release time:2021

Download address:https://go.hyper.ai/h9fii

The ForgeryNet dataset is a large and comprehensive benchmark built specifically for deep fake analysis. It contains 2.9 million images and 221,247 videos, covering 7 image-level and 8 video-level forgery manipulation methods from around the world, and supports 4 tasks at the image and video levels: image forgery classification, spatial forgery localization, video forgery classification, and temporal forgery localization.

5.FFIW10K face forgery dataset 
Publishing Agency:Computer Vision Laboratory, ETH Zurich, Institute of Artificial Intelligence, Beihang University, University of Technology Sydney

Release time:2021

Download address:https://go.hyper.ai/rstji

The dataset includes 10,000 high-quality fake videos collected from Youtube, with an average of 3 faces per frame. Each video contains real faces and fake faces, which is closer to real complex scenes. The manipulation process is fully automatic and controlled by a domain adversarial quality assessment network, making the dataset highly scalable and low-manpower.

6.Human Faces Dataset

Publishing Platform:Kaggle

Release time:2024

Estimated size:113.93 MB

Download address:https://go.hyper.ai/Ewakl

The dataset contains approximately 9.6K face images, 5K real face images, and 4.63K AI-generated face images.

7.Glint360K face recognition dataset

Publishing Agency:DeepGlint 

Release time:2021

Estimated size:161.46 GB

Download address:https://go.hyper.ai/j0rrB

The dataset consists of approximately 17 million face images, including approximately 360,000 identities. It is the largest and cleanest face recognition dataset to date. It is designed for training and evaluating large-scale face recognition models and is widely used in face recognition research and development, especially in combination with deep learning technology.

8. FaceForensics face forgery detection dataset

Publishing Agency:Technical University of Munich (TUM)

Release time:2020

Download address:https://go.hyper.ai/ItO9I

The dataset contains a large number of synthetic and real-life face manipulations from different videos on the YouTube platform, covering multiple selected video creators. By using this dataset, researchers can develop more accurate and reliable methods to detect and identify fake face images and videos.

9.UTKFace Large-Scale Face Recognition Dataset

Publishing Agency:American University

Release time:2017

Estimated size:1.45 GB

Download address:https://go.hyper.ai/8soAU

UTKFace dataset is a large-scale face dataset with a long age span (ranging from 0 to 116 years old), containing more than 20,000 facial images with age, gender and race annotations. The characters in the images vary greatly in pose, facial expression, lighting, occlusion, resolution, etc., and can be used for various tasks such as face recognition, age estimation, age change prediction, landmark positioning, etc.

10.CelebA face attribute dataset

Publishing Agency:The Chinese University of Hong Kong

Release time:2015

Estimated size:16.92 GB

Download address:https://go.hyper.ai/l0j1L

CelebFaces (CelebA) Dataset is a large-scale face attribute dataset with more than 200K celebrity images, each of which is annotated with 40 attributes, covering a wide range of poses and backgrounds. CelebA’s annotations include 10,177 identities, 202,599 facial images, and 5 landmark locations.

11.VGG-Face2 face recognition dataset
Publishing Agency:University of Oxford 

Release time:2015

Estimated size:37.49 GB

Download address:https://go.hyper.ai/XKI0Z

VGG-Face2 Dataset is a face image dataset that contains facial data of 9,131 people in total, and the images are all from Google's image search. The people in the dataset have great differences in posture, age, race, and occupation.

The above are the 11 face recognition and DeepFake datasets compiled by HyperAI. If you have resources that you want to include on the hyper.ai official website, you are welcome to leave a message or submit your contribution to tell us!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai

Finally, I recommend an academic sharing activity!

The third live broadcast of Meet AI4S invited Zhou Ziyi, a postdoctoral fellow at the Institute of Natural Sciences of Shanghai Jiao Tong University and Shanghai National Center for Applied Mathematics. Click here to make an appointment to watch the live broadcast!