Improving Object Detection via Local-global Contrastive Learning

Visual domain gaps often impact object detection performance. Image-to-imagetranslation can mitigate this effect, where contrastive approaches enablelearning of the image-to-image mapping under unsupervised regimes. However,existing methods often fail to handle content-rich scenes with multiple objectinstances, which manifests in unsatisfactory detection performance. Sensitivityto such instance-level content is typically only gained through objectannotations, which can be expensive to obtain. Towards addressing this issue,we present a novel image-to-image translation method that specifically targetscross-domain object detection. We formulate our approach as a contrastivelearning framework with an inductive prior that optimises the appearance ofobject instances through spatial attention masks, implicitly delineating thescene into foreground regions associated with the target object instances andbackground non-object regions. Instead of relying on object annotations toexplicitly account for object instances during translation, our approach learnsto represent objects by contrasting local-global information. This affordsinvestigation of an under-explored challenge: obtaining performant detection,under domain shifts, without relying on object annotations nor detector modelfine-tuning. We experiment with multiple cross-domain object detection settingsacross three challenging benchmarks and report state-of-the-art performance.Project page: https://local-global-detection.github.io