17 days ago

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh, Jan Kautz

Abstract

We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision,which is specifically tailored for vision applications. Our core contributionincludes redesigning the Mamba formulation to enhance its capability forefficient modeling of visual features. In addition, we conduct a comprehensiveablation study on the feasibility of integrating Vision Transformers (ViT) withMamba. Our results demonstrate that equipping the Mamba architecture withseveral self-attention blocks at the final layers greatly improves the modelingcapacity to capture long-range spatial dependencies. Based on our findings, weintroduce a family of MambaVision models with a hierarchical architecture tomeet various design criteria. For Image classification on ImageNet-1K dataset,MambaVision model variants achieve a new State-of-the-Art (SOTA) performance interms of Top-1 accuracy and image throughput. In downstream tasks such asobject detection, instance segmentation and semantic segmentation on MS COCOand ADE20K datasets, MambaVision outperforms comparably-sized backbones anddemonstrates more favorable performance. Code:https://github.com/NVlabs/MambaVision.