ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

The performance of neural networks scales with both their size and the amountof data they have been trained on. This is shown in both language and imagegeneration. However, this requires scaling-friendly network architectures aswell as large-scale datasets. Even though scaling-friendly architectures liketransformers have emerged for 3D vision tasks, the GPT-moment of 3D visionremains distant due to the lack of training data. In this paper, we introduceARKit LabelMaker, the first large-scale, real-world 3D dataset with densesemantic annotations. Specifically, we complement ARKitScenes dataset withdense semantic annotations that are automatically generated at scale. To thisend, we extend LabelMaker, a recent automatic annotation pipeline, to serve theneeds of large-scale pre-training. This involves extending the pipeline withcutting-edge segmentation models as well as making it robust to the challengesof large-scale processing. Further, we push forward the state-of-the-artperformance on ScanNet and ScanNet200 dataset with prevalent 3D semanticsegmentation models, demonstrating the efficacy of our generated dataset.