2 months ago

Simple Open-Vocabulary Object Detection with Vision Transformers

Minderer, Matthias ; Gritsenko, Alexey ; Stone, Austin ; Neumann, Maxim ; Weissenborn, Dirk ; Dosovitskiy, Alexey ; Mahendran, Aravindh ; Arnab, Anurag ; Dehghani, Mostafa ; Shen, Zhuoran ; Wang, Xiao ; Zhai, Xiaohua ; Kipf, Thomas ; Houlsby, Neil

View Paper Details

Simple Open-Vocabulary Object Detection with Vision Transformers

Abstract

Combining simple architectures with large-scale pre-training has led tomassive improvements in image classification. For object detection,pre-training and scaling approaches are less well established, especially inthe long-tailed and open-vocabulary setting, where training data is relativelyscarce. In this paper, we propose a strong recipe for transferring image-textmodels to open-vocabulary object detection. We use a standard VisionTransformer architecture with minimal modifications, contrastive image-textpre-training, and end-to-end detection fine-tuning. Our analysis of the scalingproperties of this setup shows that increasing image-level pre-training andmodel size yield consistent improvements on the downstream detection task. Weprovide the adaptation strategies and regularizations needed to attain verystrong performance on zero-shot text-conditioned and one-shot image-conditionedobject detection. Code and models are available on GitHub.