GIM: Learning Generalizable Image Matcher From Internet Videos

Image matching is a fundamental computer vision problem. While learning-basedmethods achieve state-of-the-art performance on existing benchmarks, theygeneralize poorly to in-the-wild images. Such methods typically need to trainseparate models for different scene types and are impractical when the scenetype is unknown in advance. One of the underlying problems is the limitedscalability of existing data construction pipelines, which limits the diversityof standard image matching datasets. To address this problem, we propose GIM, aself-training framework for learning a single generalizable model based on anyimage matching architecture using internet videos, an abundant and diverse datasource. Given an architecture, GIM first trains it on standard domain-specificdatasets and then combines it with complementary matching methods to createdense labels on nearby frames of novel videos. These labels are filtered byrobust fitting, and then enhanced by propagating them to distant frames. Thefinal model is trained on propagated data with strong augmentations. We alsopropose ZEB, the first zero-shot evaluation benchmark for image matching. Bymixing data from diverse domains, ZEB can thoroughly assess the cross-domaingeneralization performance of different methods. Applying GIM consistentlyimproves the zero-shot performance of 3 state-of-the-art image matchingarchitectures; with 50 hours of YouTube videos, the relative zero-shotperformance improves by 8.4%-18.1%. GIM also enables generalization to extremecross-domain data such as Bird Eye View (BEV) images of projected 3D pointclouds (Fig. 1(c)). More importantly, our single zero-shot model consistentlyoutperforms domain-specific baselines when evaluated on downstream tasksinherent to their respective domains. The video presentation is available athttps://www.youtube.com/watch?v=FU_MJLD8LeY.