HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven
  Perspective

Abstract

Visual Spatial Reasoning (VSR) is a core human cognitive ability and acritical requirement for advancing embodied intelligence and autonomoussystems. Despite recent progress in Vision-Language Models (VLMs), achievinghuman-level VSR remains highly challenging due to the complexity ofrepresenting and reasoning over three-dimensional space. In this paper, wepresent a systematic investigation of VSR in VLMs, encompassing a review ofexisting methodologies across input modalities, model architectures, trainingstrategies, and reasoning mechanisms. Furthermore, we categorize spatialintelligence into three levels of capability, ie, basic perception, spatialunderstanding, spatial planning, and curate SIBench, a spatial intelligencebenchmark encompassing nearly 20 open-source datasets across 23 task settings.Experiments with state-of-the-art VLMs reveal a pronounced gap betweenperception and reasoning, as models show competence in basic perceptual tasksbut consistently underperform in understanding and planning tasks, particularlyin numerical estimation, multi-view reasoning, temporal dynamics, and spatialimagination. These findings underscore the substantial challenges that remainin achieving spatial intelligence, while providing both a systematic roadmapand a comprehensive benchmark to drive future research in the field. Therelated resources of this study are accessible athttps://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective | Papers | HyperAI