Learning to Recover 3D Scene Shape from a Single Image

Despite significant progress in monocular depth estimation in the wild,recent state-of-the-art methods cannot be used to recover accurate 3D sceneshape due to an unknown depth shift induced by shift-invariant reconstructionlosses used in mixed-data depth prediction training, and possible unknowncamera focal length. We investigate this problem in detail, and propose atwo-stage framework that first predicts depth up to an unknown scale and shiftfrom a single monocular image, and then use 3D point cloud encoders to predictthe missing depth shift and focal length that allow us to recover a realistic3D scene shape. In addition, we propose an image-level normalized regressionloss and a normal-based geometry loss to enhance depth prediction modelstrained on mixed datasets. We test our depth model on nine unseen datasets andachieve state-of-the-art performance on zero-shot dataset generalization. Codeis available at: https://git.io/Depth