HyperAI

Microsoft Is Not the First and MIT Is Not the Last to Permanently Remove a Dataset

5 years ago
Big factory news
Information
神经小兮
特色图像

The Massachusetts Institute of Technology recently issued a notice to permanently remove the famous Tiny Images Dataset because it was pointed out to contain suspected racial discrimination and discrimination against women.

The Massachusetts Institute of Technology (MIT) recently issued an apology statement.It was announced that the Tiny Images Dataset will be permanently removed from the shelves, and the whole society is called upon to jointly deactivate and delete this dataset. Users who already have this dataset should not provide it to others.

In the past year, several well-known datasets released by companies and research institutions have been removed from the shelves or permanently banned, includingMicrosoft's MS Celeb 1M celebrity dataset, Duke University's Duke MTMC monitoring dataset for pedestrian recognition, and Stanford University's Brainwash dataset for head detection.

The Tiny Images Dataset that was removed this time was initiated and released by MIT in 2006. As its name suggests, this is a tiny image dataset.

Contains 79.3 million 32*32 pixel color images, mostly collected from Google Images.

The dataset is large, and the files, metadata, and descriptors are stored in binary files
Requires MATLAB toolbox and index data file to load

The entire dataset is nearly 400 Gb in size. The large size of the dataset also makes it one of the most popular datasets in the field of computer vision research.

Papers published simultaneously with this dataset "80 million tiny images: a large dataset for non-parametric object and scene recognition", the number of searchable citations for this paper is as high as 1,718.

A paper triggers a large-scale self-examination of data sets

The Tiny Images Dataset image dataset has become a hot topic because of a recently published paper titled “Large Image Dataset: a pyrrhic win for Computer Vision?”

The paper raises strong questions about the compliance of these large data sets.

Paper address: https://arxiv.org/pdf/2006.16923.pdf

One of the two authors is Vinay Prabhu, chief scientist of UnifyID, an artificial intelligence startup in Silicon Valley that provides users with solutions for identity verification.

Another author is Abeba Birhane, a PhD candidate at University College Dublin.

The paper mainly takes the ImageNet-ILSVRC-2012 dataset as an example.The author found that the dataset contains a small number of secretly photographed images (such as secretly photographing others on the beach, even including private parts).It is believed that due to lax review, these pictures seriously violate the privacy of the parties involved.

Once a classic data set, now politically incorrect

Unlike ImageNet, which is suspected of violating privacy,The reason for condemning the Tiny Images Dataset in the paper is that there are tens of thousands of images with racist and misogynistic labels in the dataset.

It also pointed out that since the Tiny Images Dataset has not been reviewed in any way, the problems of discrimination and privacy violation are more serious.

Partial selection of Tiny Images Dataset

This is about The Tiny Images Dataset is labeled based on the WordNet specification, classifying nearly 80 million images into 75,000 categories.

It is precisely because of some of the tags in WordNet that the dataset has been questioned.

WordNet is to blame, image datasets are also to blame 

As we all know, WordNet was jointly designed by psychologists, linguists and computer engineers from the Cognitive Science Laboratory of Princeton University. Since its release in 1985, it has been the most standardized and comprehensive English dictionary system in the English world.

Standardized and comprehensive means: objectively collecting English words that exist in human society and giving them understanding and association.

In the Tiny Images Dataset, 53,464 different nouns from WordNet are used as image labels.

Statistics of sensitive words related to race and gender in the dataset

It is also because of this that directly quoting expressions of human social existence will inevitably introduce some words involving racial discrimination and sexism.

For example, words that are clearly insulting or derogatory Bi*ch, Wh*re, Ni*ger etc., have become relevant labels for pictures. In addition, there are some subjective terms, such as molester pedophile wait.

  Before scientific research, we need to measure social impact 

The author believes that many large-scale image datasets were not carefully considered for social impact when they were first constructed, and may pose a threat and harm to individual rights.

Because information is now open source, anyone can use the open API to run a query to define or judge the identity or portrait of people in ImageNet or other datasets. This is indeed dangerous and an infringement on the parties involved. The author also provides three solutions:
One is synthetic reality and dataset distillation,For example, using (or enhancing) synthetic images instead of real images during model training;
Second, strengthen ethical filtering of data sets;
The third is the audit of quantitative data sets.The authors conducted a cross-category quantitative analysis of ImageNet to assess the extent of ethical violations and to measure the feasibility of model annotation-based methods.

Dataset removal: either out of self-consciousness or due to external pressure

MIT is not the first to take down a dataset due to public pressure or self-awareness. Microsoft removed the famous MS Celeb 1M dataset as early as mid-2019 and announced that it would no longer use it.

The MS Celeb 1M dataset is a dataset obtained by finding 1 million celebrities on the Internet, selecting 100,000 based on their popularity, and then using a search engine to select approximately 100 pictures of each person.

MS Celeb 1M dataset

MS Celeb 1M is often used for facial recognition training. The dataset was first used in the MSR IRC competition, which is one of the highest-level image recognition competitions in the world. Companies including IBM, Panasonic, Alibaba, Nvidia and Hitachi also use this dataset.

A researcher pointed out that this involves issues such as ethics, origin, and personal privacy of face recognition image datasets. Because these images are from the Internet, although Microsoft said that it captured and obtained these images according to the "Creative Commons License CC Agreement" (the people in the photos do not necessarily authorize the license, but the copyright owner does).

According to the agreement, the photos can be used for academic research, but after Microsoft releases the dataset, it cannot effectively supervise the use of the dataset.

In addition to the MS Celeb 1M dataset, there is also the Duke MTMC monitoring dataset for pedestrian recognition released by Duke University and the Brainwash dataset for head detection released by Stanford University.

Download other datasets as soon as possible, maybe they will be removed tomorrow

The recent Black Lives Matter racial equality movement has caused panic in all walks of life in Europe and the United States, and the computer science and engineering communities have also been constantly discussing, arguing and reflecting.

Initially, companies and organizations represented by Github and Go language began to modify the naming standards. For example, the terms "Blacklist" and "Whitelist" should be avoided, and neutral terms "Blocklist" and "Allowlist" should be used instead, or the default branch name should be changed from "master" to "trunk".

Another deep learning pioneer, Lecun, was accused of making racist and sexist remarks and voluntarily quit Twitter.

Now, political correctness may be directed at large data sets.

Admittedly, a large number of data sets have many inadequate considerations and imperfections when they were first designed. However, under current conditions, directly removing relevant data sets from the shelves is not the best way to address bias.

After all, these images do not only exist in these data sets, and these biases are not just a few words in WordNet.

Even though the dataset has been removed, the images are still everywhere on the Internet. Even though WordNet has been disabled, these words are still in people’s minds. If we want to solve the bias in AI, we still have to pay attention to the long-standing bias in social culture.

Lecun: Just a few tweets and I’m done (spreading hands)

-- over--