2 months ago

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

Ding, Runyu ; Yang, Jihan ; Xue, Chuhui ; Zhang, Wenqing ; Bai, Song ; Qi, Xiaojuan

Abstract

Open-vocabulary scene understanding aims to localize and recognize unseencategories beyond the annotated label space. The recent breakthrough of 2Dopen-vocabulary perception is largely driven by Internet-scale pairedimage-text data with rich vocabulary concepts. However, this success cannot bedirectly transferred to 3D scenarios due to the inaccessibility of large-scale3D-text pairs. To this end, we propose to distill knowledge encoded inpre-trained vision-language (VL) foundation models through captioningmulti-view images from 3D, which allows explicitly associating 3D andsemantic-rich captions. Further, to foster coarse-to-fine visual-semanticrepresentation learning from captions, we design hierarchical 3D-caption pairs,leveraging geometric constraints between 3D scenes and multi-view images.Finally, by employing contrastive learning, the model learns language-awareembeddings that connect 3D and text for open-vocabulary tasks. Our method notonly remarkably outperforms baseline methods by 25.8% $\sim$ 44.7% hIoU and14.5% $\sim$ 50.4% hAP$_{50}$ in open-vocabulary semantic and instancesegmentation, but also shows robust transferability on challenging zero-shotdomain transfer tasks. See the project website athttps://dingry.github.io/projects/PLA.