OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi

발행일: 6/4/2025

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for
Vision Language Models

초록

Spatial reasoning is a key aspect of cognitive psychology and remains a majorbottleneck for current vision-language models (VLMs). While extensive researchhas aimed to evaluate or improve VLMs' understanding of basic spatialrelations, such as distinguishing left from right, near from far, and objectcounting, these tasks represent only the most fundamental level of spatialreasoning. In this work, we introduce OmniSpatial, a comprehensive andchallenging benchmark for spatial reasoning, grounded in cognitive psychology.OmniSpatial covers four major categories: dynamic reasoning, complex spatiallogic, spatial interaction, and perspective-taking, with 50 fine-grainedsubcategories. Through Internet data crawling and careful manual annotation, weconstruct over 1.5K question-answer pairs. Extensive experiments show that bothopen- and closed-source VLMs, as well as existing reasoning and spatialunderstanding models, exhibit significant limitations in comprehensive spatialunderstanding. We further analyze failure cases and propose potentialdirections for future research.

논문 세부 정보 보기