il y a 10 heures

Yifei Li Pengyiang Liu Yuhang Zang Zhongyue Shi Qi Fu Hongye Hao Jiwen Lu

Table des matières

Résumé

Les agents multimodaux en robotique, en réalité augmentée (AR) et dans la conduite autonome doivent raisonner sur les lieux et les agencements à partir de flux egocentriques continus, en s'appuyant souvent sur des éléments de preuve situés en dehors du champ de vision actuel. Les benchmarks existants évaluent soit de manière hors ligne sur l'intégralité des vidéos, soit se concentrent sur des événements plutôt que sur la structure spatiale. Nous présentons OVO-S-Bench, un benchmark entièrement annoté par des humains dédié à l'intelligence spatiale en streaming, comprenant 1 680 questions réparties sur 348 vidéos sources. Le processus d'annotation mobilise 12 annotateurs formés, chacun jouant également le rôle de relecteur croisé en aveugle, pour un total d'environ 804 heures-personne consacrées à l'assurance qualité sur plusieurs tours. Chaque question est associée à un horodatage de requête et à un intervalle de preuve, et lors de l'évaluation, le modèle n'a accès qu'au préfixe précédant la requête. Les questions couvrent quatre niveaux d'abstraction croissante : la perception egocentrique instantanée, le suivi du contexte spatiotemporel, la simulation et le raisonnement spatiaux, ainsi que la cartographie allocentrique. Parmi les 38 MLLMs propriétaires et open-source évalués, Gemini-3.1-Pro accuse un retard de 27 points par rapport aux experts humains (59,2 contre 86,6), la cartographie allocentrique constituant le goulot d'étranglement principal. Il est à noter que les MLLMs adaptés au streaming et affinés spatialement sous-performent leurs propres backbones. Nous constatons par ailleurs que le raisonnement de type chain-of-thought amplifie les erreurs spatiales lorsqu'il n'est pas ancré dans le flux. En mettant en lumière ces limites, OVO-S-Bench établit un environnement de test exigeant pour les MLLMs spatiaux en streaming de nouvelle génération.

One-sentence Summary

OVO-S-Bench is a fully human-annotated hierarchical benchmark that evaluates multimodal LLMs on streaming spatial intelligence by restricting model input to the video prefix preceding a query timestamp across four levels of increasing abstraction, thereby addressing the limitations of offline, event-focused datasets for robotics, AR, and autonomous driving applications.

Key Contributions

OVO-S-Bench is introduced as a fully human-annotated benchmark for streaming spatial intelligence comprising 1,680 questions across 348 videos. Each question includes a query timestamp and evidence interval, enforcing a strict prefix-only viewing constraint during evaluation while spanning four abstraction levels from instantaneous egocentric perception to allocentric mapping.
Evaluation across 38 proprietary and open-source multimodal large language models reveals that Gemini-3.1-Pro trails human experts by 27 points, with allocentric mapping serving as the primary bottleneck. The results further demonstrate that streaming and spatial fine-tuning can degrade performance relative to base backbones.
Analysis of reasoning mechanisms indicates that chain-of-thought prompting amplifies spatial errors when models lack grounding in the continuous video stream. These findings establish a rigorous testbed that directs future research toward more robust streaming spatial multimodal large language models.

Introduction

Object-level change detection requires systems to compare spatial states across different observation times and identify additions, removals, movements, or alterations. This capability is essential for applications like autonomous navigation, video surveillance, and environmental monitoring, where accurately tracking state transitions over time directly informs downstream decision-making. Prior models typically struggle with this task because they lack reliable mechanisms to retain and align prior visual states with current observations, often defaulting to simple present-frame enumeration rather than performing true temporal reasoning. To address this limitation, the authors formalize object-level change detection as a core spatiotemporal consistency verification task. They introduce a structured evaluation protocol that explicitly distinguishes cross-time comparison from static scene understanding, enabling precise measurement of how well models remember and contrast visual information across temporal sequences.

Dataset

Dataset Composition and Sources

The authors introduce OVO-S-Bench, a fully human-annotated benchmark containing 1,680 questions distributed across 348 source videos.
The videos are sourced from nine publicly accessible datasets spanning five distinct regimes: indoor walkthroughs, egocentric daily activities, outdoor and world scenes, autonomous driving footage, and spatially annotated 3D environments.
Annotation was conducted by twelve trained volunteers with backgrounds in 3D computer vision, accumulating approximately 804 person-hours of work.

Subset Details and Taxonomy

The dataset is organized into four hierarchical levels of spatial abstraction, each containing specific task families and totaling thirty canonical task types.
Level one focuses on instantaneous egocentric perception, requiring answers from frames immediately surrounding the query point.
Level two tests spatiotemporal context tracking, where evidence appeared earlier in the video but is no longer visible at query time.
Level three demands spatial simulation and reasoning, including mental rotation, hypothetical state changes, and route planning.
Level four requires allocentric spatial mapping, forcing the model to integrate multiple viewpoints into a global topological representation or match trajectories to bird's-eye maps.
The mean video prefix length at query time is 8.8 minutes, with evidence intervals scaling significantly from level one to level four.

Data Usage and Evaluation Protocol

The authors use this dataset strictly for evaluation rather than model training, establishing a strict streaming protocol where models only receive the video prefix preceding the designated query timestamp.
The benchmark evaluates thirty-eight multimodal and video foundation models, including proprietary systems, general backbones, and architectures specifically fine-tuned for streaming or spatial tasks.
Each question is designed to be uniquely answerable from the visual prefix, with distractors carefully crafted to remain visually plausible while preventing resolution through language patterns or world knowledge.

Processing, Metadata, and Quality Control

The authors construct a strict temporal split using query timestamps and evidence intervals, ensuring all supporting visual cues appear strictly before the model is evaluated.
A multi-stage filtering pipeline removes shortcut vulnerabilities by running a text-only language model probe, followed by blind cross-review from a second annotator and adjudication by senior researchers.
Specialized construction techniques are applied to higher levels, including generative image editing for consistency checks, standardized named-entity labeling for unviewed rooms, and automated bird's-eye view rendering for map alignment tasks.
The final dataset is released as a single JSONL file containing structured metadata for each item, including video paths, temporal markers, task categories, and aligned answer keys, with frame extraction reserved solely for offline analysis and error categorization.

Method

The authors present a comprehensive framework for evaluating spatial reasoning in video understanding models, structured around a four-level taxonomy of spatial abstraction: L1 (Instantaneous Egocentric Perception), L2 (Spatial-Temporal Context Tracking), L3 (Generative Spatial Reasoning), and L4 (Global Topological Mapping). Each level represents a distinct cognitive and computational challenge, with L1 focusing on immediate visual perception from a first-person perspective, L2 requiring the retention of spatiotemporal facts after visual support is lost, L3 involving mental simulation of spatial changes or hypothetical scenarios, and L4 demanding the construction of an allocentric (viewer-independent) map of the environment. The framework is designed to enforce a streaming protocol, where models are only allowed to access video content up to the query timestamp, simulating real-time, causal reasoning from an egocentric video stream.

The overall architecture of the benchmark is visualized in the framework diagram, which illustrates the progression from L1 to L4. At the L1 level, the task is to understand the current view, such as identifying what supports a visible object. The L2 level introduces the challenge of tracking spatial context over time, for instance, determining if a location remains the same after a period of absence. The L3 level requires simulating spatial changes, like predicting whether an object can fit through a doorway after a rotation. Finally, the L4 level demands global topological mapping, such as inferring the directional relationship between rooms in an allocentric layout. This hierarchy ensures a systematic evaluation of spatial reasoning capabilities, from immediate perception to complex, integrated world modeling.

The model's interaction with the video stream is governed by a frame-sampling policy that determines which frames are consumed. The policy must adhere to a prefix-only constraint, meaning only frames preceding the query time are accessible. The authors consider several sampling strategies, including a naive baseline that uses only the query frame, a nearest-16f@4fps policy that samples frames from a causal sliding window, and a uniform sampling policy that distributes frames evenly across the prefix. An oracle policy, which is not available at deployment, allocates the frame budget exclusively within annotated evidence intervals to establish an upper bound on performance. Additionally, a log-decay-128 policy implements a recency-weighted schedule, allocating more frames to the most recent time intervals, thereby modeling a coarse exponential prior of near-importance without ground-truth evidence. This sampling mechanism ensures that the model must reason from a constrained, streaming input, mimicking real-world scenarios.

The evaluation framework is designed to address three structural gaps in existing benchmarks: the lack of a streaming protocol in spatial benchmarks, the limited focus on spatial structure in streaming benchmarks, and the absence of the L4 allocentric-mapping level in prior video benchmarks. By enforcing a streaming protocol at the item granularity, the benchmark ensures that evidence is ephemeral and must be reasoned about in real-time. The source videos cover a diverse range of domains, including indoor walkthroughs, outdoor footage, and 3D-rendered environments, providing a comprehensive test bed. The tasks are stratified into the four levels of spatial abstraction, with the previously untested L4 level annotated using named-entity, bird's-eye, and topological supervision, enabling the first empirical test of whether spatial reasoning gaps observed under offline access persist or worsen under causal streaming conditions.

Experiment

The evaluation tests 38 multimodal systems under a strict streaming protocol that restricts visual input to the prefix preceding the query, benchmarking them against human experts and text-only controls to isolate genuine spatial reasoning capabilities. Benchmarking validates a substantial performance gap driven by an allocentric mapping bottleneck that resists scaling and specialized training, while chain-of-thought analysis confirms that explicit reasoning aids cross-frame integration but frequently introduces grounding errors. Further experiments demonstrate that advanced frame-sampling and memory-compression strategies fail to consistently improve performance, indicating that the deficit stems from fundamental reasoning and persistent state limitations rather than contextual retrieval or architectural specialization.

The the the table compares the performance of various models on a video understanding task, focusing on overall accuracy and per-level accuracy across four difficulty levels. Specialized methods, including spatially fine-tuned, streaming, and memory-compression models, generally underperform their base models, with the most significant drops observed on higher-level tasks requiring allocentric mapping. The results highlight a consistent performance gap between models and human experts, particularly on the most complex level, and show that even advanced techniques like chain-of-thought reasoning offer limited gains on certain tasks. Specialized methods consistently underperform their base models, with the largest drops on the highest-level task requiring allocentric mapping. Chain-of-thought reasoning provides modest gains on mid-level tasks but offers little benefit for current-view perception. The performance gap between models and human experts remains substantial, especially on the most complex level, indicating a fundamental challenge in spatial reasoning from streaming video.

The authors evaluate a range of models on a video understanding task using a standardized streaming protocol, comparing performance across different models and configurations. Results show that while some models achieve high accuracy on certain metrics, there is a significant gap between human performance and the best-performing models, with a notable performance drop at the allocentric mapping level. The evaluation highlights that specialized methods and thinking modes do not consistently improve performance, and that model performance is influenced by the complexity of the spatial reasoning required. Performance varies significantly across models, with the best models achieving high accuracy on certain metrics but still falling short of human performance. A substantial performance drop occurs at the allocentric mapping level, indicating a bottleneck in spatial reasoning. Specialized methods and thinking modes do not consistently outperform their base models, suggesting that current approaches are not fully effective in addressing the core challenges of the task.

The authors evaluate a range of models on a spatial reasoning benchmark, focusing on their performance across different levels of spatial understanding. The results show that models struggle with allocentric mapping tasks, which require integrating information from a broader visual context, and that specialized methods do not consistently outperform their base backbones. Additionally, the gap between model performance and human performance remains significant, particularly on complex tasks requiring long-range spatial reasoning. Models show a significant performance gap on allocentric mapping tasks compared to simpler egocentric perception tasks. Specialized methods for streaming or memory compression do not consistently improve overall accuracy over their base backbones. The performance of models remains substantially below human levels, especially on tasks requiring long-range spatial reasoning.

The the the table presents Spearman correlation coefficients between query-level mean accuracy and evidence-span duration, as well as prefix duration, under different control conditions. The correlations are small and not statistically significant, indicating that longer evidence spans or prefixes do not consistently predict lower accuracy. The results suggest that the difficulty of higher-level tasks, particularly allocentric mapping, is not driven by the length of the input but by the complexity of the spatial reasoning required. Correlations between accuracy and evidence-span or prefix duration are weak and not statistically significant. The allocentric mapping bottleneck is not explained by input length but by the complexity of spatial reasoning. The small positive correlation between evidence span and accuracy contradicts the hypothesis that longer inputs are harder to process.

The authors evaluate a range of multimodal models on a video understanding task using a streaming protocol that limits access to frames before a query timestamp. Results show a significant performance gap between human experts and the best models, with the strongest systems still falling short on allocentric mapping tasks. The evaluation highlights that specialized methods and thinking modes do not consistently improve accuracy, and the bottleneck in spatial reasoning appears to stem from the need to abstract long-range spatial relationships rather than from limited memory or frame access. Human performance significantly exceeds the best model accuracy, especially on allocentric mapping tasks. Specialized models and thinking modes fail to outperform their base backbones, indicating a lack of improvement from current design strategies. The primary bottleneck in spatial reasoning is not frame retention or sampling but the ability to abstract long-range spatial relationships across the video prefix.

The experiments evaluate multimodal models on a video understanding benchmark using a streaming protocol that restricts frame access to a query timestamp, systematically assessing performance across escalating spatial reasoning difficulties. The results validate that specialized architectural modifications and chain-of-thought reasoning consistently fail to outperform base models, particularly on allocentric mapping tasks that demand long-range visual integration. Furthermore, the analysis confirms that task difficulty stems from the inherent complexity of abstracting spatial relationships rather than limitations in memory capacity or input length. Ultimately, these findings highlight a fundamental bottleneck in current architectures, as models remain substantially behind human experts in handling complex spatial reasoning from streaming video.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a 10 heures

Réponse À Des Questions Visuelles

Yifei Li Pengyiang Liu Yuhang Zang Zhongyue Shi Qi Fu Hongye Hao Jiwen Lu

Table des matières

Résumé

One-sentence Summary

Key Contributions

OVO-S-Bench is introduced as a fully human-annotated benchmark for streaming spatial intelligence comprising 1,680 questions across 348 videos. Each question includes a query timestamp and evidence interval, enforcing a strict prefix-only viewing constraint during evaluation while spanning four abstraction levels from instantaneous egocentric perception to allocentric mapping.
Evaluation across 38 proprietary and open-source multimodal large language models reveals that Gemini-3.1-Pro trails human experts by 27 points, with allocentric mapping serving as the primary bottleneck. The results further demonstrate that streaming and spatial fine-tuning can degrade performance relative to base backbones.
Analysis of reasoning mechanisms indicates that chain-of-thought prompting amplifies spatial errors when models lack grounding in the continuous video stream. These findings establish a rigorous testbed that directs future research toward more robust streaming spatial multimodal large language models.

Introduction

Dataset

Dataset Composition and Sources

The authors introduce OVO-S-Bench, a fully human-annotated benchmark containing 1,680 questions distributed across 348 source videos.
The videos are sourced from nine publicly accessible datasets spanning five distinct regimes: indoor walkthroughs, egocentric daily activities, outdoor and world scenes, autonomous driving footage, and spatially annotated 3D environments.
Annotation was conducted by twelve trained volunteers with backgrounds in 3D computer vision, accumulating approximately 804 person-hours of work.

Subset Details and Taxonomy

The dataset is organized into four hierarchical levels of spatial abstraction, each containing specific task families and totaling thirty canonical task types.
Level one focuses on instantaneous egocentric perception, requiring answers from frames immediately surrounding the query point.
Level two tests spatiotemporal context tracking, where evidence appeared earlier in the video but is no longer visible at query time.
Level three demands spatial simulation and reasoning, including mental rotation, hypothetical state changes, and route planning.
Level four requires allocentric spatial mapping, forcing the model to integrate multiple viewpoints into a global topological representation or match trajectories to bird's-eye maps.
The mean video prefix length at query time is 8.8 minutes, with evidence intervals scaling significantly from level one to level four.

Data Usage and Evaluation Protocol

The authors use this dataset strictly for evaluation rather than model training, establishing a strict streaming protocol where models only receive the video prefix preceding the designated query timestamp.
The benchmark evaluates thirty-eight multimodal and video foundation models, including proprietary systems, general backbones, and architectures specifically fine-tuned for streaming or spatial tasks.
Each question is designed to be uniquely answerable from the visual prefix, with distractors carefully crafted to remain visually plausible while preventing resolution through language patterns or world knowledge.

Processing, Metadata, and Quality Control

The authors construct a strict temporal split using query timestamps and evidence intervals, ensuring all supporting visual cues appear strictly before the model is evaluated.
A multi-stage filtering pipeline removes shortcut vulnerabilities by running a text-only language model probe, followed by blind cross-review from a second annotator and adjudication by senior researchers.
Specialized construction techniques are applied to higher levels, including generative image editing for consistency checks, standardized named-entity labeling for unviewed rooms, and automated bird's-eye view rendering for map alignment tasks.
The final dataset is released as a single JSONL file containing structured metadata for each item, including video paths, temporal markers, task categories, and aligned answer keys, with frame extraction reserved solely for offline analysis and error categorization.

Method

Experiment

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

OVO-S-Bench : Un benchmark hiérarchique pour l'intelligence spatiale en streaming dans les LLM multimodaux

Yifei Li Pengyiang Liu Yuhang Zang Zhongyue Shi Qi Fu Hongye Hao Jiwen Lu

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

OVO-S-Bench : Un benchmark hiérarchique pour l'intelligence spatiale en streaming dans les LLM multimodaux

Yifei Li Pengyiang Liu Yuhang Zang Zhongyue Shi Qi Fu Hongye Hao Jiwen Lu

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

OVO-S-Bench : Un benchmark hiérarchique pour l'intelligence spatiale en streaming dans les LLM multimodaux

Yifei Li Pengyiang Liu Yuhang Zang Zhongyue Shi Qi Fu Hongye Hao Jiwen Lu

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters