Improving Self-Supervised Single View Depth Estimation by Masking Occlusion
Single view depth estimation models can be trained from video footage using aself-supervised end-to-end approach with view synthesis as the supervisorysignal. This is achieved with a framework that predicts depth and cameramotion, with a loss based on reconstructing a target video frame fromtemporally adjacent frames. In this context, occlusion relates to parts of ascene that can be observed in the target frame but not in a frame used forimage reconstruction. Since the image reconstruction is based on sampling fromthe adjacent frame, and occluded areas by definition cannot be sampled,reconstructed occluded areas corrupt to the supervisory signal. In previouswork arXiv:1806.01260 occlusion is handled based on reconstruction error; ateach pixel location, only the reconstruction with the lowest error is includedin the loss. The current study aims to determine whether performanceimprovements of depth estimation models can be gained by during training onlyignoring those regions that are affected by occlusion. In this work we introduce occlusion mask, a mask that during training can beused to specifically ignore regions that cannot be reconstructed due toocclusions. Occlusion mask is based entirely on predicted depth information. Weintroduce two novel loss formulations which incorporate the occlusion mask. Themethod and implementation of arXiv:1806.01260 serves as the foundation for ourmodifications as well as the baseline in our experiments. We demonstrate that(i) incorporating occlusion mask in the loss function improves the performanceof single image depth prediction models on the KITTI benchmark. (ii) lossfunctions that select from reconstructions based on error are able to ignoresome of the reprojection error caused by object motion.