Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Global visual geolocation predicts where an image was captured on Earth.Since images vary in how precisely they can be localized, this task inherentlyinvolves a significant degree of ambiguity. However, existing approaches aredeterministic and overlook this aspect. In this paper, we aim to close the gapbetween traditional geolocalization and modern generative methods. We proposethe first generative geolocation approach based on diffusion and Riemannianflow matching, where the denoising process operates directly on the Earth'ssurface. Our model achieves state-of-the-art performance on three visualgeolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition,we introduce the task of probabilistic visual geolocation, where the modelpredicts a probability distribution over all possible locations instead of asingle point. We introduce new metrics and baselines for this task,demonstrating the advantages of our diffusion-based approach. Codes and modelswill be made available.