3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

We present 3DMV, a novel method for 3D semantic scene segmentation of RGB-Dscans in indoor environments using a joint 3D-multi-view prediction network. Incontrast to existing methods that either use geometry or RGB data as input forthis task, we combine both data modalities in a joint, end-to-end networkarchitecture. Rather than simply projecting color data into a volumetric gridand operating solely in 3D -- which would result in insufficient detail -- wefirst extract feature maps from associated RGB images. These features are thenmapped into the volumetric feature grid of a 3D network using a differentiablebackprojection layer. Since our target is 3D scanning scenarios with possiblymany frames, we use a multi-view pooling approach in order to handle a varyingnumber of RGB input views. This learned combination of RGB and geometricfeatures with our joint 2D-3D architecture achieves significantly betterresults than existing baselines. For instance, our final result on the ScanNet3D segmentation benchmark increases from 52.8\% to 75\% accuracy compared toexisting volumetric architectures.