The 2021 Image Similarity Dataset and Challenge

This paper introduces a new benchmark for large-scale image similaritydetection. This benchmark is used for the Image Similarity Challenge atNeurIPS'21 (ISC2021). The goal is to determine whether a query image is amodified copy of any image in a reference corpus of size 1~million. Thebenchmark features a variety of image transformations such as automatedtransformations, hand-crafted image edits and machine-learning basedmanipulations. This mimics real-life cases appearing in social media, forexample for integrity-related problems dealing with misinformation andobjectionable content. The strength of the image manipulations, and thereforethe difficulty of the benchmark, is calibrated according to the performance ofa set of baseline approaches. Both the query and reference set contain amajority of "distractor" images that do not match, which corresponds to areal-life needle-in-haystack setting, and the evaluation metric reflects that.We expect the DISC21 benchmark to promote image copy detection as an importantand challenging computer vision task and refresh the state of the art. Code anddata are available at https://github.com/facebookresearch/isc2021