Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling

We present Point-BERT, a new paradigm for learning Transformers to generalizethe concept of BERT to 3D point cloud. Inspired by BERT, we devise a MaskedPoint Modeling (MPM) task to pre-train point cloud Transformers. Specifically,we first divide a point cloud into several local point patches, and a pointcloud Tokenizer with a discrete Variational AutoEncoder (dVAE) is designed togenerate discrete point tokens containing meaningful local information. Then,we randomly mask out some patches of input point clouds and feed them into thebackbone Transformers. The pre-training objective is to recover the originalpoint tokens at the masked locations under the supervision of point tokensobtained by the Tokenizer. Extensive experiments demonstrate that the proposedBERT-style pre-training strategy significantly improves the performance ofstandard point cloud Transformers. Equipped with our pre-training strategy, weshow that a pure Transformer architecture attains 93.8% accuracy on ModelNet40and 83.1% accuracy on the hardest setting of ScanObjectNN, surpassing carefullydesigned point cloud models with much fewer hand-made designs. We alsodemonstrate that the representations learned by Point-BERT transfer well to newtasks and domains, where our models largely advance the state-of-the-art offew-shot point cloud classification task. The code and pre-trained models areavailable at https://github.com/lulutang0608/Point-BERT