Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning

Masked Autoencoders (MAE) have demonstrated promising performance inself-supervised learning for both 2D and 3D computer vision. Nevertheless,existing MAE-based methods still have certain drawbacks. Firstly, thefunctional decoupling between the encoder and decoder is incomplete, whichlimits the encoder's representation learning ability. Secondly, downstreamtasks solely utilize the encoder, failing to fully leverage the knowledgeacquired through the encoder-decoder architecture in the pre-text task. In thispaper, we propose Point Regress AutoEncoder (Point-RAE), a new scheme forregressive autoencoders for point cloud self-supervised learning. The proposedmethod decouples functions between the decoder and the encoder by introducing amask regressor, which predicts the masked patch representation from the visiblepatch representation encoded by the encoder and the decoder reconstructs thetarget from the predicted masked patch representation. By doing so, we minimizethe impact of decoder updates on the representation space of the encoder.Moreover, we introduce an alignment constraint to ensure that therepresentations for masked patches, predicted from the encoded representationsof visible patches, are aligned with the masked patch presentations computedfrom the encoder. To make full use of the knowledge learned in the pre-trainingstage, we design a new finetune mode for the proposed Point-RAE. Extensiveexperiments demonstrate that our approach is efficient during pre-training andgeneralizes well on various downstream tasks. Specifically, our pre-trainedmodels achieve a high accuracy of \textbf{90.28\%} on the ScanObjectNN hardestsplit and \textbf{94.1\%} accuracy on ModelNet40, surpassing all the otherself-supervised learning methods. Our code and pretrained model are publicavailable at: \url{https://github.com/liuyyy111/Point-RAE}.