Attention-Based Multimodal Image Matching

We propose an attention-based approach for multimodal image patch matchingusing a Transformer encoder attending to the feature maps of a multiscaleSiamese CNN. Our encoder is shown to efficiently aggregate multiscale imageembeddings while emphasizing task-specific appearance-invariant image cues. Wealso introduce an attention-residual architecture, using a residual connectionbypassing the encoder. This additional learning signal facilitates end-to-endtraining from scratch. Our approach is experimentally shown to achieve newstate-of-the-art accuracy on both multimodal and single modality benchmarks,illustrating its general applicability. To the best of our knowledge, this isthe first successful implementation of the Transformer encoder architecture tothe multimodal image patch matching task.