ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection

Effective feature fusion of multispectral images plays a crucial role inmulti-spectral object detection. Previous studies have demonstrated theeffectiveness of feature fusion using convolutional neural networks, but thesemethods are sensitive to image misalignment due to the inherent deffciency inlocal-range feature interaction resulting in the performance degradation. Toaddress this issue, a novel feature fusion framework of dual cross-attentiontransformers is proposed to model global feature interaction and capturecomplementary information across modalities simultaneously. This frameworkenhances the discriminability of object features through the query-guidedcross-attention mechanism, leading to improved performance. However, stackingmultiple transformer blocks for feature enhancement incurs a large number ofparameters and high spatial complexity. To handle this, inspired by the humanprocess of reviewing knowledge, an iterative interaction mechanism is proposedto share parameters among block-wise multimodal transformers, reducing modelcomplexity and computation cost. The proposed method is general and effectiveto be integrated into different detection frameworks and used with differentbackbones. Experimental results on KAIST, FLIR, and VEDAI datasets show thatthe proposed method achieves superior performance and faster inference, makingit suitable for various practical scenarios. Code will be available athttps://github.com/chanchanchan97/ICAFusion.