HyperAIHyperAI
2 months ago

Exploring Plain Vision Transformer Backbones for Object Detection

Li, Yanghao ; Mao, Hanzi ; Girshick, Ross ; He, Kaiming
Exploring Plain Vision Transformer Backbones for Object Detection
Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbonenetwork for object detection. This design enables the original ViT architectureto be fine-tuned for object detection without needing to redesign ahierarchical backbone for pre-training. With minimal adaptations forfine-tuning, our plain-backbone detector can achieve competitive results.Surprisingly, we observe: (i) it is sufficient to build a simple featurepyramid from a single-scale feature map (without the common FPN design) and(ii) it is sufficient to use window attention (without shifting) aided withvery few cross-window propagation blocks. With plain ViT backbones pre-trainedas Masked Autoencoders (MAE), our detector, named ViTDet, can compete with theprevious leading methods that were all based on hierarchical backbones,reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1Kpre-training. We hope our study will draw attention to research onplain-backbone detectors. Code for ViTDet is available in Detectron2.

Exploring Plain Vision Transformer Backbones for Object Detection | Latest Papers | HyperAI