RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Zhou, Enshen ; An, Jingkun ; Chi, Cheng ; Han, Yi ; Rong, Shanyu ; Zhang, Chi ; Wang, Pengwei ; Wang, Zhongyuan ; Huang, Tiejun ; Sheng, Lu ; Zhang, Shanghang

Release Date: 6/8/2025

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language
Models for Robotics

Abstract

Spatial referring is a fundamental capability of embodied robots to interactwith the 3D physical world. However, even with the powerful pretrained visionlanguage models (VLMs), recent approaches are still not qualified to accuratelyunderstand the complex 3D scenes and dynamically reason about theinstruction-indicated locations for interaction. To this end, we proposeRoboRefer, a 3D-aware VLM that can first achieve precise spatial understandingby integrating a disentangled but dedicated depth encoder via supervisedfine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatialreasoning via reinforcement fine-tuning (RFT), with metric-sensitive processreward functions tailored for spatial referring tasks. To support SFT and RFTtraining, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2xprior), covering 31 spatial relations (vs. 15 prior) and supporting complexreasoning processes (up to 5 steps). In addition, we introduceRefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatialreferring with multi-step reasoning. Experiments show that SFT-trainedRoboRefer achieves state-of-the-art spatial understanding, with an averagesuccess rate of 89.6%. RFT-trained RoboRefer further outperforms all otherbaselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in averageaccuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with variouscontrol policies to execute long-horizon, dynamic tasks across diverse robots(e,g., UR5, G1 humanoid) in cluttered real-world scenes.

View Paper Details