HyperAIHyperAI

Command Palette

Search for a command to run...

SpatialTree:MLLMにおける空間能力の分岐構造

Yuxi Xiao Longfei Li Shen Yan Xinhang Liu Sida Peng Yunchao Wei Xiaowei Zhou Bingyi Kang

Abstract

認知科学は、空間的能力が「知覚→推論→相互作用」という段階的発展を遂げると示唆している。しかし、マルチモーダル大規模言語モデル(MLLM)においては、この階層構造は依然として十分に理解されておらず、多数の研究が限定されたタスクに焦点を当てている。本研究では、認知科学にインスパイアされた階層構造「SpatialTree」を提案する。この構造は、空間的能力を4つのレベルに分類する:低レベルの知覚(L1)、 mentally mapping(L2)、シミュレーション(L3)、エージェント的熟達性(L4)。この分類に基づき、27のサブ能力を網羅的に評価する、初めての能力中心の階層的ベンチマークを構築した。評価結果から、明確な構造が明らかになった。L1の能力は互いにほとんど直交しているのに対し、上位レベルの能力は強く相関しており、段階的な依存性の増加が示された。対象的な教師あり微調整(supervised fine-tuning)により、驚くべき転移特性が発見された。L1内では負の転移が生じる一方で、低レベルから高レベルへの能力間では強力なクロスレベル転移が観察され、顕著な相乗効果も確認された。さらに、この階層全体の向上方法について検討した。その結果、単純に「深く考える」ことを促すナチュラルな強化学習(RL)は信頼性が低く、複雑な推論には役立つものの、直感的知覚を損なうことがわかった。そこで、不要な検討を抑制するシンプルな「自動的思考(auto-think)」戦略を提案。この戦略により、RLがすべてのレベルで一貫して性能向上を実現できるようになった。SpatialTreeの構築を通じて、MLLMにおける空間的能力の理解と体系的拡張のための概念実証フレームワークを提供した。

One-sentence Summary

Researchers from Zhejiang University, ByteDance Seed, and Beijing Jiaotong University propose SpatialTree, a cognitive-science-inspired four-level hierarchy for spatial abilities in multimodal LLMs. Their benchmark reveals orthogonal low-level perception skills but strongly correlated higher-level reasoning abilities, uncovering cross-level transfer benefits. They introduce an auto-think strategy that suppresses unnecessary deliberation during reinforcement learning, consistently enhancing performance across all levels unlike prior unreliable methods for systematically scaling spatial cognition.

Key Contributions

  • Introduces SpatialTree, a cognitive-science-inspired hierarchical taxonomy organizing spatial abilities in multimodal LLMs into four progressive levels (L1 perception to L4 agentic competence), addressing the lack of structured understanding beyond narrow task-centric evaluations. This framework enables systematic analysis of spatial skill dependencies, revealing through evaluation across 27 sub-abilities that L1 skills are largely orthogonal while higher-level skills exhibit strong correlations.
  • Proposes the first capability-centric benchmark for spatial intelligence, rigorously assessing mainstream MLLMs and uncovering critical transfer dynamics: negative transfer occurs within L1 perception skills during fine-tuning, but strong cross-level transfer emerges from low- to high-level abilities with notable synergy, highlighting pathways for efficient capability scaling.
  • Identifies limitations in reinforcement learning for spatial tasks—where extensive "thinking" improves complex reasoning but degrades intuitive perception—and introduces an auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently enhance performance across all hierarchical levels as validated in the benchmark.

Introduction

Spatial intelligence—the ability to perceive, reason about, and interact with 2D/3D spaces—is critical for multimodal large language models (MLLMs) to handle real-world tasks like navigation or physical reasoning, yet its complexity has hindered systematic study. Prior work fragmented spatial abilities into isolated task-specific evaluations, such as single-image positioning or 3D point cloud analysis, failing to reveal how foundational skills compose hierarchically or transfer across contexts. The authors address this by introducing SpatialTree, the first capability-centric framework that organizes spatial intelligence into a four-layer cognitive hierarchy (L1–L4), enabling structured analysis of ability emergence and targeted scaling strategies for MLLMs through strategic data utilization.

Dataset

The authors introduce SpatialTree-Bench, a capability-centric benchmark for spatial intelligence, structured into four hierarchical levels. Here's a concise overview:

  • Composition and sources:

    • Built by unifying 10+ existing datasets (e.g., CameraBench, MMSI-Bench) covering L1–L3 spatial abilities.
    • Augmented with SpatialPlus, a new dataset targeting gaps (especially L4), generated via the SpatialEngine using 3D reconstruction data, game footage, egocentric videos, and robotics datasets.
  • Key subset details:

    • L1 Perception (size: ~15K samples): Uses expert models (DepthAnything3, SpatialTracker) to extract geometric attributes (distance, size, motion). Filtered via QA templates and LLM rephrasing.
    • L2 Mental Mapping (size: ~8K samples): Processes BEV maps from 3D reconstruction pipelines; augmented with spatial captions and memory-retrieval QAs via multimodal LLMs.
    • L3 Mental Simulation (size: ~12K samples): Enhances reasoning tasks with structured Chain-of-Thought templates; filters prioritize causal/relational problem-solving.
    • L4 Spatial Agent (size: ~5K samples): Curates Internet-sourced navigation/manipulation data (human hands, robots); filters enforce multi-step action sequences via manual annotation.
  • Usage in training/evaluation:

    • Training splits combine SpatialPlus with general visual-instruction data (LLaVA-Video, LLaVA-NeXT) using VST’s mixture ratio (80% general data, 20% spatial data).
    • Evaluated via hybrid multi-option + LLM-as-a-Judge protocols for granular capability assessment.
    • SFT/RL experiments reveal hierarchical dependencies: L1 skills transfer to higher levels, while RL exposes reasoning/perception trade-offs.
  • Processing strategies:

    • L4 actions discretized into high-level motion primitives (e.g., "Dolly In," "Pan Left") via an Action-Extraction Pipeline.
    • Multi-format QA generation (e.g., multiple-choice, abstract descriptions) boosts diversity for single problems.
    • Human annotators convert interaction sequences into executable multi-step tasks; LLMs rephrase QAs for linguistic consistency.

Method

The authors leverage a hierarchical taxonomy, termed SpatialTree, to structure and evaluate spatial intelligence across four progressive levels, from foundational perception to agentic execution. This framework is designed to mirror cognitive development, where lower layers provide the perceptual scaffolding necessary for higher-order reasoning and action. The architecture is not merely descriptive but operationalized through a multi-stage data engine and a weighted aggregation metric that reflects the dependency structure of spatial capabilities.

At the base, Level 1 (Perception) encapsulates primitive visual sensing abilities: Geometry (Distance, Size, Shape), Motion (Egocentric, Allocentric), Orientation (Gravity, Object), Relation (Topology, Correspondence), and Localization (Detection, Grounding). These are instantiated via automated pipelines that leverage expert models and LLMs to generate question-answer pairs from web images, ensuring broad coverage of real-world spatial cues. As shown in the figure below, these perceptual primitives are the building blocks for all subsequent layers.

Level 2 (Mental Mapping) transitions from raw perception to semantic alignment. It comprises Understanding—translating visual scenes into linguistic descriptions, recognizing affordances, and taking alternative perspectives—and Memory, which constructs and retrieves cognitive maps from fragmented observations. The data engine for this level ingests raw images and videos, applies reconstruction and augmentation pipelines, and uses multi-modal captioning to generate prompts that test spatial understanding and memory retrieval. The figure below illustrates how this layer bridges perception with language and memory, enabling the system to reason about object function and spatial history.

Level 3 (Mental Simulation) introduces causal and sequential reasoning over spatial configurations. It includes tasks such as geometry puzzles, route planning, and operation sequencing, which require the agent to simulate transformations and predict outcomes. The data engine for this level uses Chain-of-Thought (CoT) pipelines and LLM rephrasing to generate complex reasoning prompts from annotated datasets, such as block manipulation sequences. This layer demands the integration of memory and understanding to simulate spatial dynamics before execution.

Level 4 (Agentic Competence) represents the culmination: the ability to translate internal plans into executable actions in 3D environments. The authors formalize the agent’s decision-making as a probabilistic model:

(St,At,Mt)Pθ(Ot,Ht1),whereHt1={(O0,A0,M0),,(Ot1,At1,Mt1)}(S_t, A_t, M_t) \sim P_\theta \Bigl( \, \cdot \, \bigm| \, O_t, H_{t-1} \Bigr), \text{where} \quad H_{t-1} = \{ (O_0, A_0, M_0), \ldots, (O_{t-1}, A_{t-1}, M_{t-1}) \}(St,At,Mt)Pθ(Ot,Ht1),whereHt1={(O0,A0,M0),,(Ot1,At1,Mt1)}

where OtO_tOt is the multi-modal observation, StS_tSt the latent state, AtA_tAt the action, and MtM_tMt the updated memory. Actions are mapped to a standardized 6-DoF motion space for navigation (e.g., dolly, truck, pedestal, pan, tilt, roll) and a 7-DoF space for manipulation (including gripper state), enabling cross-domain evaluation. The data engine for L4 curates action annotations from game videos, robotic datasets, and human-hand manipulation clips, using human annotators and action-extraction pipelines to generate prompts that require step-by-step control sequences.

To evaluate performance across this hierarchy, the authors implement a bottom-up weighted aggregation scheme. As shown in the figure below, each node in the SpatialTree is assigned a weight reflecting its foundational importance, with L1 receiving the highest weight (0.25) due to its role as a prerequisite for higher layers. The final score is computed recursively: the score of a parent node is the weighted sum of its children’s scores. This ensures the metric is both theoretically grounded in cognitive hierarchy and empirically validated through correlation analysis of model performance.

The training and evaluation process is thus not monolithic but tiered: models are assessed on increasingly complex tasks, with performance at each level informing the next. The data engines ensure scalability and diversity, while the action mapping and prompt templates standardize output formats for fair comparison. This architecture enables a granular, interpretable assessment of spatial intelligence, moving beyond single-task benchmarks to a unified, multi-layered evaluation framework.

Experiment

  • Evaluated MLLMs on SpatialTree-Bench across 27 sub-abilities: Gemini 2.5 Pro achieved 50.1 overall, with Qwen3-VL (open-source) scoring 40.0; revealed L1 skills as orthogonal while higher levels (L3-L4) showed strong correlations indicating interdependency.
  • Supervised fine-tuning on low-level abilities (e.g., distance, size) demonstrated negative intra-L1 transfer but significant cross-level gains, improving robotic manipulation by 27.1% and complex reasoning by 36.0%; blended training of multiple abilities yielded +1.1 overall synergy.
  • Hierarchy-aware reinforcement learning with auto-think strategy suppressed excessive deliberation for intuitive tasks, enabling consistent performance improvements across all spatial levels compared to unreliable naive RL approaches.

The authors evaluate Qwen2.5-VL-7B under various reinforcement learning strategies, finding that full RL with auto-think yields the highest average score (30.8) and improves most sub-abilities, particularly in L4 Agentic Competence. In contrast, naive RL without auto-think harms performance in L2 Mental Mapping and L3 Mental Simulation, while targeted RL per level shows mixed gains, indicating that hierarchical reward design is critical for balanced spatial ability development.

The authors evaluate fine-tuned multimodal LLMs across a hierarchical spatial benchmark, finding that combining multiple low-level perception abilities (distance, size, correspondence) yields synergistic gains in higher-level tasks, while individual fine-tuning often harms performance within the same level. Results show that blended training improves overall scores and mitigates negative transfer, particularly benefiting mental mapping and agentic competence.

The authors evaluate multimodal LLMs across a hierarchical spatial benchmark, finding that Gemini 2.5 Pro achieves the highest overall score (50.1), while Qwen3VL-235B leads among open-source models (40.0). Results show strong performance in higher-level tasks like goal execution and open exploration for thinking models, whereas non-thinking models such as GPT-4o and Gemini 2.5 Pro NT lag behind in agentic competence despite competitive perception scores. Open-source models generally underperform, with Qwen3VL-235B showing the best balance across levels but still trailing proprietary thinking models in simulation and agentic tasks.

The authors use Pearson correlation analysis to reveal that higher-level spatial abilities (L3–L4) are strongly interdependent, while lower-level perception skills (L1) show weak correlations, indicating they operate largely independently. Results show that fine-tuning on specific low-level abilities can yield negative transfer within the same level but enable positive cross-level transfer to higher-order reasoning and execution tasks. The observed structure supports a hierarchical model of spatial intelligence, where foundational perception skills underpin increasingly complex, interdependent reasoning capabilities.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています