EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota

公開日: 6/4/2025

EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation
with Large Multimodal Models

要約

Large Multimodal Models (LMMs) have demonstrated strong performance invarious vision-language tasks. However, they often struggle to comprehensivelyunderstand Earth Observation (EO) data, which is critical for monitoring theenvironment and the effects of human activity on it. In this work, we presentEarthMind, a novel vision-language framework for multi-granular andmulti-sensor EO data understanding. EarthMind features two core components: (1)Spatial Attention Prompting (SAP), which reallocates attention within the LLMto enhance pixel-level understanding; and (2) Cross-modal Fusion, which alignsheterogeneous modalities into a shared space and adaptively reweighs tokensbased on their information density for effective fusion. To facilitatemulti-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensivebenchmark with over 2,000 human-annotated multi-sensor image-question pairs,covering a wide range of perception and reasoning tasks. Extensive experimentsdemonstrate the effectiveness of EarthMind. It achieves state-of-the-artperformance on EarthMind-Bench, surpassing GPT-4o despite being only 4B inscale. Moreover, EarthMind outperforms existing methods on multiple public EObenchmarks, showcasing its potential to handle both multi-granular andmulti-sensor challenges in a unified framework.

論文の詳細を見る View Code