HyperAIHyperAI
3 months ago

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

{Yifeng Shi†, Xin Hao∗, Feng Lv∗, Xinliang Wang∗, Chunlong Xia*}
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
Abstract

Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform wellin dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of featurescale. Most existing studies are devoted to designing visionspecific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we presenta plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction,named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the stateof-the-art, ViT-CoMer has the following advantages: (1) Weinject spatial pyramid multi-receptive field convolutionalfeatures into the ViT architecture, which effectively alleviates the problems of limited local information interactionand single-feature representation in ViT. (2) We proposea simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusionacross hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense predictiontasks, different frameworks, and multiple advanced pretraining. Notably, our ViT-CoMer-L achieves 64.3% AP onCOCO val2017 without extra training data, and 62.1%mIoU on ADE20K val, both of which are comparable tostate-of-the-art methods. We hope ViT-CoMer can serveas a new backbone for dense prediction tasks to facilitatefuture research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.