HyperAI超神経

EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota
公開日: 6/4/2025
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation
  with Large Multimodal Models
要約

Large Multimodal Models (LMMs) have demonstrated strong performance invarious vision-language tasks. However, they often struggle to comprehensivelyunderstand Earth Observation (EO) data, which is critical for monitoring theenvironment and the effects of human activity on it. In this work, we presentEarthMind, a novel vision-language framework for multi-granular andmulti-sensor EO data understanding. EarthMind features two core components: (1)Spatial Attention Prompting (SAP), which reallocates attention within the LLMto enhance pixel-level understanding; and (2) Cross-modal Fusion, which alignsheterogeneous modalities into a shared space and adaptively reweighs tokensbased on their information density for effective fusion. To facilitatemulti-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensivebenchmark with over 2,000 human-annotated multi-sensor image-question pairs,covering a wide range of perception and reasoning tasks. Extensive experimentsdemonstrate the effectiveness of EarthMind. It achieves state-of-the-artperformance on EarthMind-Bench, surpassing GPT-4o despite being only 4B inscale. Moreover, EarthMind outperforms existing methods on multiple public EObenchmarks, showcasing its potential to handle both multi-granular andmulti-sensor challenges in a unified framework.