X-Pose: Detecting Any Keypoints

This work aims to address an advanced keypoint detection problem: how toaccurately detect any keypoints in complex real-world scenarios, which involvesmassive, messy, and open-ended objects as well as their associated keypointsdefinitions. Current high-performance keypoint detectors often fail to tacklethis problem due to their two-stage schemes, under-explored prompt designs, andlimited training data. To bridge the gap, we propose X-Pose, a novel end-to-endframework with multi-modal (i.e., visual, textual, or their combinations)prompts to detect multi-object keypoints for any articulated (e.g., human andanimal), rigid, and soft objects within a given image. Moreover, we introduce alarge-scale dataset called UniKPT, which unifies 13 keypoint detection datasetswith 338 keypoints across 1,237 categories over 400K instances. Training withUniKPT, X-Pose effectively aligns text-to-keypoint and image-to-keypoint due tothe mutual enhancement of multi-modal prompts based on cross-modalitycontrastive learning. Our experimental results demonstrate that X-Pose achievesnotable improvements of 27.7 AP, 6.44 PCK, and 7.0 AP compared tostate-of-the-art non-promptable, visual prompt-based, and textual prompt-basedmethods in each respective fair setting. More importantly, the in-the-wild testdemonstrates X-Pose's strong fine-grained keypoint localization andgeneralization abilities across image styles, object categories, and poses,paving a new path to multi-object keypoint detection in real applications. Ourcode and dataset are available at https://github.com/IDEA-Research/X-Pose.