HyperAIHyperAI

Command Palette

Search for a command to run...

Simple Open-Vocabulary Object Detection with Vision Transformers

Abstract

Combining simple architectures with large-scale pre-training has led tomassive improvements in image classification. For object detection,pre-training and scaling approaches are less well established, especially inthe long-tailed and open-vocabulary setting, where training data is relativelyscarce. In this paper, we propose a strong recipe for transferring image-textmodels to open-vocabulary object detection. We use a standard VisionTransformer architecture with minimal modifications, contrastive image-textpre-training, and end-to-end detection fine-tuning. Our analysis of the scalingproperties of this setup shows that increasing image-level pre-training andmodel size yield consistent improvements on the downstream detection task. Weprovide the adaptation strategies and regularizations needed to attain verystrong performance on zero-shot text-conditioned and one-shot image-conditionedobject detection. Code and models are available on GitHub.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Simple Open-Vocabulary Object Detection with Vision Transformers | Papers | HyperAI