CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Open-vocabulary detection (OVD) is an object detection task aiming atdetecting objects from novel categories beyond the base categories on which thedetector is trained. Recent OVD methods rely on large-scale visual-languagepre-trained models, such as CLIP, for recognizing novel objects. We identifythe two core obstacles that need to be tackled when incorporating these modelsinto detector training: (1) the distribution mismatch that happens whenapplying a VL-model trained on whole images to region recognition tasks; (2)the difficulty of localizing objects of unseen classes. To overcome theseobstacles, we propose CORA, a DETR-style framework that adapts CLIP forOpen-vocabulary detection by Region prompting and Anchor pre-matching. Regionprompting mitigates the whole-to-region distribution gap by prompting theregion features of the CLIP-based region classifier. Anchor pre-matching helpslearning generalizable object localization by a class-aware matching mechanism.We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novelclasses, which outperforms the previous SOTA by 2.4 AP50 even without resortingto extra training data. When extra training data is available, we trainCORA$^+$ on both ground-truth base-category annotations and additional pseudobounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCOOVD benchmark and 28.1 box APr on the LVIS OVD benchmark.