ODIN: A Single Model for 2D and 3D Segmentation

State-of-the-art models on contemporary 3D segmentation benchmarks likeScanNet consume and label dataset-provided 3D point clouds, obtained throughpost processing of sensed multiview RGB-D images. They are typically trainedin-domain, forego large-scale 2D pre-training and outperform alternatives thatfeaturize the posed RGB-D multiview images instead. The gap in performancebetween methods that consume posed images versus post-processed 3D point cloudshas fueled the belief that 2D and 3D perception require distinct modelarchitectures. In this paper, we challenge this view and propose ODIN(Omni-Dimensional INstance segmentation), a model that can segment and labelboth 2D RGB images and 3D point clouds, using a transformer architecture thatalternates between 2D within-view and 3D cross-view information fusion. Ourmodel differentiates 2D and 3D feature operations through the positionalencodings of the tokens involved, which capture pixel coordinates for 2D patchtokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-artperformance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentationbenchmarks, and competitive performance on ScanNet, S3DIS and COCO. Itoutperforms all previous works by a wide margin when the sensed 3D point cloudis used in place of the point cloud sampled from 3D mesh. When used as the 3Dperception engine in an instructable embodied agent architecture, it sets a newstate-of-the-art on the TEACh action-from-dialogue benchmark. Our code andcheckpoints can be found at the project website (https://odin-seg.github.io).