Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Complex 3D scene understanding has gained increasing attention, with sceneencoding strategies playing a crucial role in this success. However, theoptimal scene encoding strategies for various scenarios remain unclear,particularly compared to their image-based counterparts. To address this issue,we present a comprehensive study that probes various visual encoding models for3D scene understanding, identifying the strengths and limitations of each modelacross different scenarios. Our evaluation spans seven vision foundationencoders, including image-based, video-based, and 3D foundation models. Weevaluate these models in four tasks: Vision-Language Scene Reasoning, VisualGrounding, Segmentation, and Registration, each focusing on different aspectsof scene understanding. Our evaluations yield key findings: DINOv2 demonstratessuperior performance, video models excel in object-level tasks, diffusionmodels benefit geometric tasks, and language-pretrained models show unexpectedlimitations in language-related tasks. These insights challenge someconventional understandings, provide novel perspectives on leveraging visualfoundation models, and highlight the need for more flexible encoder selectionin future vision-language and scene-understanding tasks. Code:https://github.com/YunzeMan/Lexicon3D