HyperAIHyperAI

Command Palette

Search for a command to run...

Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li Hanzi Mao Ross Girshick† Kaiming He‡

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbonenetwork for object detection. This design enables the original ViT architectureto be fine-tuned for object detection without needing to redesign ahierarchical backbone for pre-training. With minimal adaptations forfine-tuning, our plain-backbone detector can achieve competitive results.Surprisingly, we observe: (i) it is sufficient to build a simple featurepyramid from a single-scale feature map (without the common FPN design) and(ii) it is sufficient to use window attention (without shifting) aided withvery few cross-window propagation blocks. With plain ViT backbones pre-trainedas Masked Autoencoders (MAE), our detector, named ViTDet, can compete with theprevious leading methods that were all based on hierarchical backbones,reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1Kpre-training. We hope our study will draw attention to research onplain-backbone detectors. Code for ViTDet is available in Detectron2.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp