Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

Recent work has shown that CNN-based depth and ego-motion estimators can belearned using unlabelled monocular videos. However, the performance is limitedby unidentified moving objects that violate the underlying static sceneassumption in geometric image reconstruction. More significantly, due to lackof proper constraints, networks output scale-inconsistent results overdifferent samples, i.e., the ego-motion network cannot provide full cameratrajectories over a long video sequence because of the per-frame scaleambiguity. This paper tackles these challenges by proposing a geometryconsistency loss for scale-consistent predictions and an inducedself-discovered mask for handling moving objects and occlusions. Since we donot leverage multi-task learning like recent works, our framework is muchsimpler and more efficient. Comprehensive evaluation results demonstrate thatour depth estimator achieves the state-of-the-art performance on the KITTIdataset. Moreover, we show that our ego-motion network is able to predict aglobally scale-consistent camera trajectory for long video sequences, and theresulting visual odometry accuracy is competitive with the recent model that istrained using stereo videos. To the best of our knowledge, this is the firstwork to show that deep networks trained using unlabelled monocular videos canpredict globally scale-consistent camera trajectories over a long videosequence.