8 months ago

Abstract

Pre-training by numerous image data has become de-facto for robust 2Drepresentations. In contrast, due to the expensive data acquisition andannotation, a paucity of large-scale 3D datasets severely hinders the learningfor high-quality 3D features. In this paper, we propose an alternative toobtain superior 3D representations from 2D pre-trained models viaImage-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervisedpre-training, we leverage the well learned 2D knowledge to guide 3D maskedautoencoding, which reconstructs the masked point tokens with anencoder-decoder architecture. Specifically, we first utilize off-the-shelf 2Dmodels to extract the multi-view visual features of the input point cloud, andthen conduct two types of image-to-point learning schemes on top. For one, weintroduce a 2D-guided masking strategy that maintains semantically importantpoint tokens to be visible for the encoder. Compared to random masking, thenetwork can better concentrate on significant 3D structures and recover themasked tokens from key spatial cues. For another, we enforce these visibletokens to reconstruct the corresponding multi-view 2D features after thedecoder. This enables the network to effectively inherit high-level 2Dsemantics learned from rich image data for discriminative 3D modeling. Aided byour image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning,achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fullytrained results of existing methods. By further fine-tuning on onScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11%accuracy, +3.68% to the second-best, demonstrating superior transferablecapacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.

Source PDF