8 months ago

Abstract

To date, various 3D scene understanding tasks still lack practical andgeneralizable pre-trained models, primarily due to the intricate nature of 3Dscene understanding tasks and their immense variations introduced by cameraviews, lighting, occlusions, etc. In this paper, we tackle this challenge byintroducing a spatio-temporal representation learning (STRL) framework, capableof learning from unlabeled 3D point clouds in a self-supervised fashion.Inspired by how infants learn from visual data in the wild, we explore the richspatio-temporal cues derived from the 3D data. Specifically, STRL takes twotemporally-correlated frames from a 3D point cloud sequence as the input,transforms it with the spatial data augmentation, and learns the invariantrepresentation self-supervisedly. To corroborate the efficacy of STRL, weconduct extensive experiments on three types (synthetic, indoor, and outdoor)of datasets. Experimental results demonstrate that, compared with supervisedlearning methods, the learned self-supervised representation facilitatesvarious models to attain comparable or even better performances while capableof generalizing pre-trained models to downstream tasks, including 3D shapeclassification, 3D object detection, and 3D semantic segmentation. Moreover,the spatio-temporal contextual cues embedded in 3D point clouds significantlyimprove the learned representations.

Source PDF View Code