Single Image Depth Estimation Trained via Depth from Defocus Cues

Estimating depth from a single RGB images is a fundamental task in computervision, which is most directly solved using supervised deep learning. In thefield of unsupervised learning of depth from a single RGB image, depth is notgiven explicitly. Existing work in the field receives either a stereo pair, amonocular video, or multiple views, and, using losses that are based onstructure-from-motion, trains a depth estimation network. In this work, werely, instead of different views, on depth from focus cues. Learning is basedon a novel Point Spread Function convolutional layer, which applies locationspecific kernels that arise from the Circle-Of-Confusion in each imagelocation. We evaluate our method on data derived from five common datasets fordepth estimation and lightfield images, and present results that are on parwith supervised methods on KITTI and Make3D datasets and outperformunsupervised learning approaches. Since the phenomenon of depth from defocus isnot dataset specific, we hypothesize that learning based on it would overfitless to the specific content in each dataset. Our experiments show that this isindeed the case, and an estimator learned on one dataset using our methodprovides better results on other datasets, than the directly supervisedmethods.