Predicting Ground-Level Scene Layout from Aerial Imagery

We introduce a novel strategy for learning to extract semantically meaningfulfeatures from aerial imagery. Instead of manually labeling the aerial imagery,we propose to predict (noisy) semantic features automatically extracted fromco-located ground imagery. Our network architecture takes an aerial image asinput, extracts features using a convolutional neural network, and then appliesan adaptive transformation to map these features into the ground-levelperspective. We use an end-to-end learning approach to minimize the differencebetween the semantic segmentation extracted directly from the ground image andthe semantic segmentation predicted solely based on the aerial image. We showthat a model learned using this strategy, with no additional training, isalready capable of rough semantic labeling of aerial imagery. Furthermore, wedemonstrate that by finetuning this model we can achieve more accurate semanticsegmentation than two baseline initialization strategies. We use our network toaddress the task of estimating the geolocation and geoorientation of a groundimage. Finally, we show how features extracted from an aerial image can be usedto hallucinate a plausible ground-level panorama.