a month ago

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi

Abstract

Recent advanced vision-language models(VLMs) have demonstrated strongperformance on passive, offline image and video understanding tasks. However,their effectiveness in embodied settings, which require online interaction andactive scene understanding remains limited. In such scenarios, an agentperceives the environment from a first-person perspective, with each actiondynamically shaping subsequent observations. Even state-of-the-art models suchas GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environmentinteractions, exhibiting clear limitations in spatial reasoning andlong-horizon planning. To address this gap, we introduce EmRACE-3K, a datasetof over 3,000 language-guided tasks situated in diverse, photorealisticenvironments constructed using Unreal Engine and the UnrealCV-Zoo framework.The tasks encompass a wide range of embodied challenges, including navigation,object manipulation, and multi-stage goal execution. Each task unfolds as amulti-step trajectory, pairing first-person visual observations with high-levelinstructions, grounded actions, and natural language rationales that expressthe agent's intent at every step. Using EmRACE-3K, we establish a benchmark toevaluate the embodied reasoning capabilities of VLMs across three keydimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stageGoal Execution. In zero-shot settings, all models achieve success rates below20%, underscoring the challenge posed by our benchmark and the currentlimitations of VLMs in interactive environments. To demonstrate the utility ofEmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learningfollowed by reinforcement learning. This approach yields substantialimprovements across all three challenge categories, highlighting the dataset'seffectiveness in enabling the development of embodied reasoning capabilities.