7 months ago

Abstract

Recent advances in multimodal large language models (MLLMs) have shownremarkable capabilities in integrating vision and language for complexreasoning. While most existing benchmarks evaluate models under offlinesettings with a fixed set of pre-recorded inputs, we introduce OST-Bench, abenchmark designed to evaluate Online Spatio-Temporal understanding from theperspective of an agent actively exploring a scene. The Online aspectemphasizes the need to process and reason over incrementally acquiredobservations, while the Spatio-Temporal component requires integrating currentvisual inputs with historical memory to support dynamic spatial reasoning.OST-Bench better reflects the challenges of real-world embodied perception.Built on an efficient data collection pipeline, OST-Bench consists of 1.4kscenes and 10k question-answer pairs collected from ScanNet, Matterport3D, andARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe thatthey fall short on tasks requiring complex spatio-temporal reasoning. Under theonline setting, their accuracy declines as the exploration horizon extends andthe memory grows. Through further experimental analysis, we identify commonerror patterns across models and find that both complex clue-based spatialreasoning demands and long-term memory retrieval requirements significantlydrop model performance along two separate axes, highlighting the corechallenges that must be addressed to improve online embodied reasoning. Tofoster further research and development in the field, our codes, dataset, andbenchmark are available. Our project page is:https://rbler1234.github.io/OSTBench.github.io/

Source PDF View Code