4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

In many robotics and VR/AR applications, 3D-videos are readily-availablesources of input (a continuous sequence of depth images, or LIDAR scans).However, those 3D-videos are processed frame-by-frame either through 2Dconvnets or 3D perception algorithms. In this work, we propose 4-dimensionalconvolutional neural networks for spatio-temporal perception that can directlyprocess such 3D-videos using high-dimensional convolutions. For this, we adoptsparse tensors and propose the generalized sparse convolution that encompassesall discrete convolutions. To implement the generalized sparse convolution, wecreate an open-source auto-differentiation library for sparse tensors thatprovides extensive functions for high-dimensional convolutional neuralnetworks. We create 4D spatio-temporal convolutional neural networks using thelibrary and validate them on various 3D semantic segmentation benchmarks andproposed 4D datasets for 3D-video perception. To overcome challenges in the 4Dspace, we propose the hybrid kernel, a special case of the generalized sparseconvolution, and the trilateral-stationary conditional random field thatenforces spatio-temporal consistency in the 7D space-time-chroma space.Experimentally, we show that convolutional neural networks with onlygeneralized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods bya large margin. Also, we show that on 3D-videos, 4D spatio-temporalconvolutional neural networks are robust to noise, outperform 3D convolutionalneural networks and are faster than the 3D counterpart in some cases.