HyperAIHyperAI
2 months ago

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Ding, Runyu ; Yang, Jihan ; Xue, Chuhui ; Zhang, Wenqing ; Bai, Song ; Qi, Xiaojuan
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
  Understanding
Abstract

Open-world instance-level scene understanding aims to locate and recognizeunseen object categories that are not present in the annotated dataset. Thistask is challenging because the model needs to both localize novel 3D objectsand infer their semantic categories. A key factor for the recent progress in 2Dopen-world perception is the availability of large-scale image-text pairs fromthe Internet, which cover a wide range of vocabulary concepts. However, thissuccess is hard to replicate in 3D scenarios due to the scarcity of 3D-textpairs. To address this challenge, we propose to harness pre-trainedvision-language (VL) foundation models that encode extensive knowledge fromimage-text pairs to generate captions for multi-view images of 3D scenes. Thisallows us to establish explicit associations between 3D shapes andsemantic-rich captions. Moreover, to enhance the fine-grained visual-semanticrepresentation learning from captions for object-level categorization, wedesign hierarchical point-caption association methods to learn semantic-awareembeddings that exploit the 3D geometry between 3D points and multi-viewimages. In addition, to tackle the localization challenge for novel classes inthe open-world setting, we develop debiased instance localization, whichinvolves training object grouping modules on unlabeled data usinginstance-level pseudo supervision. This significantly improves thegeneralization capabilities of instance grouping and thus the ability toaccurately locate novel objects. We conduct extensive experiments on 3Dsemantic, instance, and panoptic segmentation tasks, covering indoor andoutdoor scenes across three datasets. Our method outperforms baseline methodsby a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%),instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g.14.7%$\sim$43.3%). Code will be available.

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding | Latest Papers | HyperAI