HyperAIHyperAI

Command Palette

Search for a command to run...

空間理論:Foundation Modelsは能動的な探索を通じて空間的信念を構築できるか?

概要

空間的な具現化知能(Spatial Embodied Intelligence)は、多くの場合、部分観測性(partial observability)の下で動作します。そこでは、エージェントは完全な観測情報を受動的に消費するのではなく、欠落している情報を獲得するために自ら行動を起こさなければなりません。このような環境においては、不確実性を低減し、空間的な理解の構築を支援するような「情報の価値が高いアクション(informative actions)」を能動的に選択できるかどうかが、進展の鍵となります。マルチモーダル基盤モデル(multimodal foundation models)は、受動的なマルチモーダル知覚および推論タスクにおいて高い性能を示していますが、部分観測性下における能動的かつ自律的な探索(self-directed exploration)を支援する能力については、これまで体系的な研究が行われてきませんでした。特に、これらのモデルが、時間の経過とともに一貫した空間的な信念(spatial belief)を構築・維持するために、「次に何を観測すべきか」を判断できるのか、またそれはどのように実現されるのかについては、依然として不明な点が多く残されています。そこで本研究では、THEORY OF SPACEを提案します。これは、自律的な能動的探索を通じて能動的に情報を獲得し、一連の部分的な観測から空間的な信念を構築、修正、および活用するエージェントの能力と定義されます。我々は、テキストおよび視覚的な環境を用いたbenchmarkを用いて、このTHEORY OF SPACEを実装しました。本研究の目的は、特定のタスクを解決することではなく、好奇心駆動型の探索(curiosity-driven exploration)を通じて、完全かつ正確な空間的信念を構築することにあります。

One-sentence Summary

The authors propose THEORY OF SPACE, defining an agent's ability to construct, revise, and exploit a spatial belief through self-directed active exploration under partial observability, and implement this via a benchmark with textual and visual environments where foundation models engage in curiosity-driven exploration to build a complete, accurate spatial belief from sequential, partial observations rather than solving specific tasks.

Key Contributions

  • This work defines THEORY OF SPACE as the capacity of foundation models to actively acquire information and construct a coherent spatial belief through self-directed exploration under partial observability. The framework shifts spatial evaluation from answering questions at fixed views to building and maintaining revisable world models over time.
  • A new multimodal benchmark implements this concept using parallel text- and vision-based worlds that allow for controlled diagnosis of failures across symbolic versus perceptual observation streams. The system requires agents to externalize evolving cognitive maps and uncertainty, making spatial belief measurable rather than implicit during task-agnostic exploration.
  • Empirical results demonstrate that active exploration creates a significant bottleneck where perception errors and belief instability lead to global map corruption. Analysis of belief probes shows that models exhibit strong inertia when revising obsolete priors, particularly in vision-based updates regarding orientation and facing.

Introduction

Spatial embodied intelligence operates under partial observability, necessitating active action selection to construct spatial understanding. While multimodal foundation models perform well on passive perception tasks, existing benchmarks rarely assess their ability to support self-directed exploration or maintain coherent spatial beliefs over time. Prior work often conflates exploration efficiency with specific task goals or treats internal cognitive states as opaque. To address this, the authors introduce THEORY OF SPACE, a framework that evaluates an agent's capacity to actively acquire information and revise internal spatial beliefs without relying on specific downstream tasks. They implement a benchmark featuring text and vision environments and develop spatial belief probing to externalize and measure the quality of the agent's cognitive map. This methodology exposes critical limitations in current models, including performance degradation during active exploration and an inability to overwrite obsolete spatial priors.

Dataset

  • Dataset Composition and Sources: The authors utilize procedurally generated multi-room indoor layouts on an N by M grid rather than static real-world data. Visual assets are sourced from the Objaverse library and rendered using the ThreeDWorld simulator.
  • Key Details for Subsets: The environment supports parallel Text and Visual Worlds. The Visual World provides ego-centric RGB images at 384 by 384 resolution using a library of 293 distinct 3D models. To ensure diversity, each object type appears at most once within a single scene. The Text World offers symbolic observations with discretized bins for direction and distance.
  • Usage in the Study: The benchmarking process divides interaction into an Exploration Phase for belief construction and a Reasoning Phase for spatial tasks. Agents interact via a Gym-style interface using high-level actions like Observe and Rotate. Evaluation tasks employ open-ended questions to measure Route and Survey knowledge while minimizing knowledge leakage.
  • Processing and Metadata Construction: Spatial relationships are discretized into eight 45-degree bins for allocentric direction and five labels for egocentric views within a 90-degree field of view. Distance is categorized into six bins ranging from same to very far. The visual setting includes reference images to calibrate perception of unit distance and angular cones.

Method

The authors formalize the Theory of Space as the capacity to manipulate a probabilistic belief BtB_tBt through three core operations: Construct, Revise, and Exploit. The overall framework involves an agent navigating a partially observable environment to perform active exploration and update its internal spatial belief, as illustrated in the framework diagram.

The agent operates within a discretized observation space to facilitate reasoning. Visual and textual observations are mapped to specific distance bins (near, mid, far) and angular sectors (e.g., front-left, front-right), providing a structured input for the model, as shown in the figure below.

To diagnose how foundation models manage these beliefs, the method employs an explicit probing mechanism. The agent processes its exploration history to generate a structured cognitive map and identify unexplored regions, effectively externalizing its internal spatial representation, as depicted in the figure below.

The assessment of belief exploitation is categorized into two primary tasks: Belief on Route and Belief on Survey. The former evaluates egocentric, path-based reasoning and landmark relations, while the latter assesses allocentric, map-like understanding and global spatial inference, as detailed in the figure below.

Finally, the agent is guided by a comprehensive set of prompts that define the exploration goals, action constraints, and formatting rules. These prompts ensure the agent adheres to the spatial reasoning tasks and provides structured outputs, as shown in the figure below.

Experiment

The evaluation framework assesses spatial cognition through active exploration and passive comprehension settings across both text and vision modalities, utilizing standardized proxy agents to isolate reasoning capabilities from exploration efficiency. Results indicate a significant modality gap where text-based performance consistently exceeds vision-based reasoning, while active exploration strategies generally underperform passive comprehension due to incomplete information coverage and higher action costs. Diagnostic probing of cognitive maps highlights that visual agents suffer from unstable belief updates and difficulty overwriting obsolete priors during environmental shifts.

The provided data compares the active exploration exploitation performance of GPT-5.2 and GEMINI-3 PRO in text and vision environments. GEMINI-3 PRO achieves higher performance than GPT-5.2 across both modalities. Additionally, the results show that performance in the vision setting is higher than in the text setting for both models. GEMINI-3 PRO outperforms GPT-5.2 in both text and vision tasks. Vision-based performance metrics exceed those of the text-based setting. The performance gap between the two models is larger in the vision modality.

The evaluation highlights a substantial performance disparity between text-based and vision-based environments, with accuracy metrics significantly higher in text settings. In vision-based tasks, GEMINI-3 PRO consistently outperforms GPT-5.2 across correctness and perception categories, while GPT-5.2 demonstrates higher stability in text-based scenarios. Both models face significant challenges with orientation estimation in visual environments, where scores are notably lower than positional accuracy. Text-based environments yield substantially higher correctness and perception scores compared to vision-based environments. GEMINI-3 PRO achieves superior overall correctness and perception in vision-based tasks compared to GPT-5.2. Orientation accuracy is significantly lower than positional accuracy in vision-based settings for both models.

The data compares proprietary models on spatial reasoning tasks across vision-based and text-based environments, highlighting a significant modality gap where text performance is superior. GEMINI-3 PRO achieves higher average scores in the vision-based setting, while GPT-5.2 demonstrates stronger performance in the text-based setting. Text-based reasoning tasks yield significantly higher accuracy scores than vision-based tasks for both models. GEMINI-3 PRO outperforms GPT-5.2 in the vision-based world across the majority of spatial reasoning metrics. GPT-5.2 achieves a higher overall average than GEMINI-3 PRO in the text-based world environment.

The authors evaluate proprietary models on spatial reasoning tasks divided into Route and Survey categories across vision and text environments. Results show a significant modality gap where text-based performance substantially exceeds vision-based performance for all tasks. GPT-5.2 demonstrates the highest overall average scores in both modalities within this specific evaluation setup. Text-based environments yield significantly higher accuracy across all spatial reasoning tasks compared to vision-based settings. GPT-5.2 achieves higher average performance than GEMINI-3 PRO in both text and vision modalities in this evaluation. Perception and mental rotation tasks exhibit a sharp decline in effectiveness when transitioning from text to visual inputs.

The authors evaluate spatial reasoning capabilities in multi-room environments, comparing 2-room and 4-room configurations across text and vision modalities. Results show that increasing environmental complexity leads to a decline in overall performance and significantly widens the gap between passive comprehension and active exploration success. GEMINI-3 PRO demonstrates greater robustness in active tasks within complex layouts compared to GPT-5.2, although both models perform substantially better in text-based settings than vision-based ones. Performance metrics decline and the discrepancy between passive and active results grows as the number of rooms increases. GEMINI-3 PRO maintains higher active exploration accuracy relative to passive performance in 4-room settings compared to GPT-5.2. Vision-based environments consistently result in lower accuracy scores compared to text-based environments for both models.

The evaluation compares GPT-5.2 and GEMINI-3 PRO on spatial reasoning and active exploration tasks across text and vision modalities with increasing environmental complexity. A consistent finding across all setups is that text-based performance substantially exceeds vision-based accuracy for both models. Performance outcomes vary by evaluation context, with GEMINI-3 PRO leading in visual robustness and GPT-5.2 excelling in specific text-based or overall configurations.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています