HyperAI

Vision-Language Geo-Foundation Models (VLGFM) is an artificial intelligence model designed specifically for processing and analyzing earth observation data. It combines visual and language information to improve the understanding and analysis of geospatial data. VLGFM is able to perform a variety of tasks, including multimodal tasks such as image description, image-text retrieval, visual question answering, and visual localization.

The concept of VLGFM was first described in the paperTowards Vision-Language Geo-Foundation Model: A Survey", a review paper jointly completed by researchers from Nanyang Technological University, SenseTime, Shanghai AI Lab and Shanghai Jiao Tong University, and published in 2024. This paper is the first literature review on VLGFM. It discusses the differences between VLGFM and visual geographic basic models and visual language proprietary models, and summarizes the model architecture and commonly used datasets of existing VLGFM.

Visual Linguistic Geography Foundation Model (VLGFM)