Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Modern 3D semantic instance segmentation approaches predominantly rely onspecialized voting mechanisms followed by carefully designed geometricclustering techniques. Building on the successes of recent Transformer-basedmethods for object detection and image segmentation, we propose the firstTransformer-based approach for 3D semantic instance segmentation. We show thatwe can leverage generic Transformer building blocks to directly predictinstance masks from 3D point clouds. In our model called Mask3D each objectinstance is represented as an instance query. Using Transformer decoders, theinstance queries are learned by iteratively attending to point cloud featuresat multiple scales. Combined with point features, the instance queries directlyyield all instance masks in parallel. Mask3D has several advantages overcurrent state-of-the-art approaches, since it neither relies on (1) votingschemes which require hand-selected geometric properties (such as centers) nor(2) geometric grouping mechanisms requiring manually-tuned hyper-parameters(e.g. radii) and (3) enables a loss that directly optimizes instance masks.Mask3D sets a new state-of-the-art on ScanNet test (+6.2 mAP), S3DIS 6-fold(+10.1 mAP), STPLS3D (+11.2 mAP) and ScanNet200 test (+12.4 mAP).