8 months ago

Object Detection

Method/Architecture

Computer Vision

Yanghao Li Hanzi Mao Ross Girshick† Kaiming He‡

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbonenetwork for object detection. This design enables the original ViT architectureto be fine-tuned for object detection without needing to redesign ahierarchical backbone for pre-training. With minimal adaptations forfine-tuning, our plain-backbone detector can achieve competitive results.Surprisingly, we observe: (i) it is sufficient to build a simple featurepyramid from a single-scale feature map (without the common FPN design) and(ii) it is sufficient to use window attention (without shifting) aided withvery few cross-window propagation blocks. With plain ViT backbones pre-trainedas Masked Autoencoders (MAE), our detector, named ViTDet, can compete with theprevious leading methods that were all based on hierarchical backbones,reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1Kpre-training. We hope our study will draw attention to research onplain-backbone detectors. Code for ViTDet is available in Detectron2.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Object Detection

Method/Architecture

Computer Vision

Yanghao Li Hanzi Mao Ross Girshick† Kaiming He‡

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbonenetwork for object detection. This design enables the original ViT architectureto be fine-tuned for object detection without needing to redesign ahierarchical backbone for pre-training. With minimal adaptations forfine-tuning, our plain-backbone detector can achieve competitive results.Surprisingly, we observe: (i) it is sufficient to build a simple featurepyramid from a single-scale feature map (without the common FPN design) and(ii) it is sufficient to use window attention (without shifting) aided withvery few cross-window propagation blocks. With plain ViT backbones pre-trainedas Masked Autoencoders (MAE), our detector, named ViTDet, can compete with theprevious leading methods that were all based on hierarchical backbones,reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1Kpre-training. We hope our study will draw attention to research onplain-backbone detectors. Code for ViTDet is available in Detectron2.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Exploring Plain Vision Transformer Backbones for Object Detection | Papers | HyperAI