HyperAI

A few days ago, Microsoft quietly deleted a public celebrity image dataset. This dataset contains 100,000 celebrity facial images and is often used for face recognition training. We don't know the real reason why Microsoft deleted it, but the data privacy issues involved, including the security specifications of face recognition technology, are worth pondering.

Microsoft removed a celebrity image dataset last week. The dataset, which was the world's largest public facial recognition dataset, is no longer accessible through Microsoft's channels.

What issues are involved behind this "silent" deletion?

The trouble Microsoft wants to solve: MS Celeb celebrity dataset

The MS Celeb 1M dataset was first released by Microsoft in 2016 and contains 100,000 celebrities, nearly 10 million facial images, and these data are collected from the Internet.

We selected 100,000 celebrities out of 1 million on the Internet based on their popularity, and then used a search engine to pull out about 100 pictures of each person to get this huge data set.

Jobs images from the MS Celeb dataset, where green is the image of his youth and red is the synthetic image

This dataset was originally used to serve the competition. MSR IRC It is one of the highest-level image recognition competitions in the world, and the MS Celeb 1M dataset was originally used for this competition.

MS Celeb 1M is often used for facial recognition training. However, these images are from the Internet, so they have been questioned. Microsoft said that they are based on Creative Commons License CC Agreement, to grab and get these images.

Under the agreement, the photos can be reused for academic research (the people in the photos do not necessarily authorize permission, but the copyright owner does). But after Microsoft releases the data set, it cannot control its use. The Financial Times conducted an in-depth investigation, and the results showed that the data was used extensively in multiple corporate tests.

Companies including IBM, Panasonic, Alibaba, Nvidia and Hitachi have used this dataset.

This involves some normative issues in the use of datasets. One researcher also pointed out that this involvesFaceIdentify issues such as ethics, provenance, and privacy of image datasets.

Deletion reason: Did the employee responsible for this dataset leave?

Microsoft has silently removed MS Celeb 1M from online without any specific explanation.

The download page of the dataset on Github has become 404

In a report in the Financial Times, Microsoft said "The main purpose of this website is for academic purposes,"The reason for deleting it is that"The employee who ran the project left and is no longer working with Microsoft, so it was deleted."

We all believe that there must be other reasons, and there may be problems with the data set images. Although Microsoft said that the data set is all from photos of public figures, it also includes a small number of non-famous people. The owners of these facial photos have questioned and criticized Microsoft for using their names and image information.

Some technical personnel also speculated that Microsoft might be charged with violating the EU General Data Protection Regulation (GDPR)As for deleting data, the law came into effect last year and aims to establish data security protections.

GDPR has brought the protection and supervision of personal information to an unprecedented level

But Microsoft said that they were not involved in the provisions of GDPR, and the dataset-related websites were retired simply because "the competition was over."

Of course, this time Microsoft removed the MS Celeb dataset.This does not prevent it from being used normally in academic research and other channels.. Tools for working with databases can now also be accessed normally.

Commonly used public data sets may also have privacy issues

After the Financial Times investigation, two other academic institutions also deleted relevant data sets: Duke University Duke MTMC Monitoring Dataset, and Stanford University Brainwash dataset.

This is not the first time that data sets and privacy issues have come into people’s attention. At the end of January this year, IBM released a million-level unbiased “face diversity” data set, which caused widespread controversy.

Although IBM emphasized that this move is to reduce the "bias" problem in facial recognition, the source of the data set, the degree of awareness of the characters and other issues have aroused a lot of doubts.

Some media also reported that IBM said it would delete the relevant photos in the data set according to the wishes of the subjects, but these were just one-sided statements and no actual action was taken.

**In May this year, San Francisco issued an ordinance prohibiting government agencies from using facial recognition technology.**

The rules for collecting and using data sets are still a very unclear area, especially with the convenience of the Internet, many institutions can easily obtain large numbers of images for purposes such as facial recognition.

In fact, the solution to the privacy issues involved in the data set can be very simple:When it comes to user personal privacy information, the user's right to know should be guaranteed, and whether the user is willing to contribute data should be ensured..

But what seems to be missing is never the method, but the awareness.

Command Palette

Microsoft Deletes well-known Data Sets, Clearing up the Mystery of Data Privacy

The trouble Microsoft wants to solve: MS Celeb celebrity dataset

Deletion reason: Did the employee responsible for this dataset leave?

Commonly used public data sets may also have privacy issues