2 months ago

Detecting Twenty-thousand Classes using Image-level Supervision

Zhou, Xingyi ; Girdhar, Rohit ; Joulin, Armand ; Krähenbühl, Philipp ; Misra, Ishan

Abstract

Current object detectors are limited in vocabulary size due to the smallscale of detection datasets. Image classifiers, on the other hand, reason aboutmuch larger vocabularies, as their datasets are larger and easier to collect.We propose Detic, which simply trains the classifiers of a detector on imageclassification data and thus expands the vocabulary of detectors to tens ofthousands of concepts. Unlike prior work, Detic does not need complexassignment schemes to assign image labels to boxes based on model predictions,making it much easier to implement and compatible with a range of detectionarchitectures and backbones. Our results show that Detic yields excellentdetectors even for classes without box annotations. It outperforms prior workon both open-vocabulary and long-tail detection benchmarks. Detic provides again of 2.4 mAP for all classes and 8.3 mAP for novel classes on theopen-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic obtains41.7 mAP when evaluated on all classes, or only rare classes, hence closing thegap in performance for object categories with few samples. For the first time,we train a detector with all the twenty-one-thousand classes of the ImageNetdataset and show that it generalizes to new datasets without finetuning. Codeis available at \url{https://github.com/facebookresearch/Detic}.