CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding

Manual annotation of large-scale point cloud dataset for varying tasks suchas 3D object classification, segmentation and detection is often laboriousowing to the irregular structure of point clouds. Self-supervised learning,which operates without any human labeling, is a promising approach to addressthis issue. We observe in the real world that humans are capable of mapping thevisual concepts learnt from 2D images to understand the 3D world. Encouraged bythis insight, we propose CrossPoint, a simple cross-modal contrastive learningapproach to learn transferable 3D point cloud representations. It enables a3D-2D correspondence of objects by maximizing agreement between point cloudsand the corresponding rendered 2D image in the invariant space, whileencouraging invariance to transformations in the point cloud modality. Ourjoint training objective combines the feature correspondences within and acrossmodalities, thus ensembles a rich learning signal from both 3D point cloud and2D image modalities in a self-supervised fashion. Experimental results showthat our approach outperforms the previous unsupervised learning methods on adiverse range of downstream tasks including 3D object classification andsegmentation. Further, the ablation studies validate the potency of ourapproach for a better point cloud understanding. Code and pretrained models areavailable at http://github.com/MohamedAfham/CrossPoint.