A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Cross-view geo-localization is a task of matching the same geographic imagefrom different views, e.g., unmanned aerial vehicle (UAV) and satellite. Themost difficult challenges are the position shift and the uncertainty ofdistance and scale. Existing methods are mainly aimed at digging for morecomprehensive fine-grained information. However, it underestimates theimportance of extracting robust feature representation and the impact offeature alignment. The CNN-based methods have achieved great success incross-view geo-localization. However it still has some limitations, e.g., itcan only extract part of the information in the neighborhood and some scalereduction operations will make some fine-grained information lost. Inparticular, we introduce a simple and efficient transformer-based structurecalled Feature Segmentation and Region Alignment (FSRA) to enhance the model'sability to understand contextual information as well as to understand thedistribution of instances. Without using additional supervisory information,FSRA divides regions based on the heat distribution of the transformer'sfeature map, and then aligns multiple specific regions in different views oneon one. Finally, FSRA integrates each region into a set of featurerepresentations. The difference is that FSRA does not divide regions manually,but automatically based on the heat distribution of the feature map. So thatspecific instances can still be divided and aligned when there are significantshifts and scale changes in the image. In addition, a multiple samplingstrategy is proposed to overcome the disparity in the number of satelliteimages and that of images from other sources. Experiments show that theproposed method has superior performance and achieves the state-of-the-art inboth tasks of drone view target localization and drone navigation. Code will bereleased at https://github.com/Dmmm1997/FSRA