8 months ago

Abstract

Vision Language Place Recognition (VLVPR) enhances robot localizationperformance by incorporating natural language descriptions from images. Byutilizing language information, VLVPR directs robot place matching, overcomingthe constraint of solely depending on vision. The essence of multimodal fusionlies in mining the complementary information between different modalities.However, general fusion methods rely on traditional neural architectures andare not well equipped to capture the dynamics of cross modal interactions,especially in the presence of complex intra modal and inter modal correlations.To this end, this paper proposes a novel coarse to fine and end to endconnected cross modal place recognition framework, called MambaPlace. In thecoarse localization stage, the text description and 3D point cloud are encodedby the pretrained T5 and instance encoder, respectively. They are thenprocessed using Text Attention Mamba (TAM) and Point Clouds Mamba (PCM) fordata enhancement and alignment. In the subsequent fine localization stage, thefeatures of the text description and 3D point cloud are cross modally fused andfurther enhanced through cascaded Cross Attention Mamba (CCAM). Finally, wepredict the positional offset from the fused text point cloud features,achieving the most accurate localization. Extensive experiments show thatMambaPlace achieves improved localization accuracy on the KITTI360Pose datasetcompared to the state of the art methods.

Source PDF