MIT takes down 80 Million Tiny Images data set due to racist and offensive content
Creators of the 80 Million Tiny Images data set from MIT and NYU took the collection offline this week, apologized, and asked other researchers to refrain from using the data set and delete any existing copies. The news was shared Monday in a letter by MIT professors Bill Freeman and Antonio Torralba and NYU professor Rob Fergus published on the MIT CSAIL website.
Introduced in 2006 and containing photos scraped from internet search engines, 80 Million Tiny Images was recently found to contain a range of racist, sexist, and otherwise offensive labels such as nearly 2,000 images labeled with the N-word, and labels like “rape suspect” and “child molester.” The data set also contained pornographic content like non-consensual photos taken up women’s skirts. Creators of the 79.3 million-image data set said it was too large and its 32 x 32 images too small, making visual inspection of the data set’s complete contents difficult. According to Google Scholar, 80 Million Tiny Images has been cited more 1,700 times.
“Biases, offensive and prejudicial images, and derogatory terminology alienates an important part of our community — precisely those that we are making efforts to include,” the professors wrote in a joint letter. “It also contributes to harmful biases in AI systems trained on such data. Additionally, the presence of such prejudicial images hurts efforts to foster a culture of inclusivity in the computer vision community. This is extremely unfortunate and runs counter to the values that we strive to uphold.”
The trio of professors say the data set’s shortcoming were brought to their attention by an analysis and audit published late last month (PDF) by University of Dublin Ph.D. student Abeba Birhane and Carnegie Mellon University Ph.D. student Vinay Prabhu. The authors say their assessment is the first known critique of 80 Million Tiny Images.
VB Transform 2020 Online – July 15-17. Join leading AI executives: Register for the free livestream.
Both the paper authors and the 80 Million Tiny Images creators say part of the problem comes from automated data collection and nouns from the WordNet data set for semantic hierarchy. Before the data set was taken offline, the coauthors suggested the creators of 80 Million Tiny Images do like ImageNet creators did and assess labels used in the people category of the data set. The paper finds that large-scale image data sets erode privacy and can have a disproportionately negative impact on women, racial and ethnic minorities, and communities at the margin of society.
Birhane and Prabhu assert that the computer vision community must begin having more conversations about the ethical use of large-scale image data sets now in part due to the growing availability of image-scraping tools and reverse image search technology. Citing previous work like the Excavating AI analysis of ImageNet, the analysis of large-scale image data sets shows that it’s not just a matter of data, but a matter of a culture in academia and industry that finds it acceptable to create large-scale data sets without the consent of participants “under the guise of anonymization.”
“[W]e posit that the deeper problems are rooted in the wider structural traditions, incentives, and discourse of a field that treats ethical issues as an afterthought. A field where in the wild is often a euphemism for without consent. We are up against a system that has veritably mastered ethics shopping, ethics bluewashing, ethics lobbying, ethics dumping, and ethics shirking,” the paper states.
To create more ethical large-scale image data sets, Birhane and Prabhu suggest:
Blur the faces of people in data sets
Do not use Creative Commons licensed material
Collect imagery with clear consent from data set participants
Include a data set audit card with large-scale image data sets, akin to the model cards Google AI uses and the datasheets for data sets Microsoft Research proposed
The work incorporates Birhane’s previous work on relational ethics, which suggests that the creators of machine learning systems should begin their work by speaking with the people most affected by machine learning systems, and that concepts of bias, fairness, and justice are moving targets.
ImageNet was introduced at CVPR in 2009 and is widely considered important to the advancement of computer vision and machine learning. Whereas previously some of the largest data sets could be counted in the tens of thousands, ImageNet contains more than 14 million images. The ImageNet Large Scale Visual Recognition Challenge ran from 2010 to 2017 and led to the launch of a variety of startups like Clarifai and MetaMind, a company Salesforce acquired in 2017. According to Google Scholar, ImageNet has been cited nearly 17,000 times.
As part of a series of changes detailed in December 2019, ImageNet creators including lead author Jia Deng and Dr. Fei-Fei Li found that 1,593 of the 2,832 people categories in the data set potentially contain offensive labels, which they said they plan to remove.
“We indeed celebrate ImageNet’s achievement and recognize the creators’ efforts to grapple with some ethical questions. Nonetheless, ImageNet as well as other large image datasets remain troublesome,” the Birhane and Prabhu paper reads.