2 months ago

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Zhu, Xiangyang ; Zhang, Renrui ; He, Bowei ; Guo, Ziyu ; Zeng, Ziyao ; Qin, Zipeng ; Zhang, Shanghang ; Gao, Peng

Abstract

Large-scale pre-trained models have shown promising open-world performancefor both vision and language tasks. However, their transferred capacity on 3Dpoint clouds is still limited and only constrained to the classification task.In this paper, we first collaborate CLIP and GPT to be a unified 3D open-worldlearner, named as PointCLIP V2, which fully unleashes their potential forzero-shot 3D classification, segmentation, and detection. To better align 3Ddata with the pre-trained language knowledge, PointCLIP V2 contains two keydesigns. For the visual end, we prompt CLIP via a shape projection module togenerate more realistic depth maps, narrowing the domain gap between projectedpoint clouds with natural images. For the textual end, we prompt the GPT modelto generate 3D-specific text as the input of CLIP's textual encoder. Withoutany training in 3D domains, our approach significantly surpasses PointCLIP by+42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3Dclassification. On top of that, V2 can be extended to few-shot 3Dclassification, zero-shot 3D part segmentation, and 3D object detection in asimple manner, demonstrating our generalization ability for unified 3Dopen-world learning.