Cross-Modality Fusion Transformer for Multispectral Object Detection

Multispectral image pairs can provide the combined information, making objectdetection applications more reliable and robust in the open world. To fullyexploit the different modalities, we present a simple yet effectivecross-modality feature fusion approach, named Cross-Modality Fusion Transformer(CFT) in this paper. Unlike prior CNNs-based works, guided by the transformerscheme, our network learns long-range dependencies and integrates globalcontextual information in the feature extraction stage. More importantly, byleveraging the self attention of the transformer, the network can naturallycarry out simultaneous intra-modality and inter-modality fusion, and robustlycapture the latent interactions between RGB and Thermal domains, therebysignificantly improving the performance of multispectral object detection.Extensive experiments and ablation studies on multiple datasets demonstratethat our approach is effective and achieves state-of-the-art detectionperformance. Our code and models are available athttps://github.com/DocF/multispectral-object-detection.