8 months ago

Abstract

Category-agnostic pose estimation (CAPE) has traditionally relied on supportimages with annotated keypoints, a process that is often cumbersome and mayfail to fully capture the necessary correspondences across diverse objectcategories. Recent efforts have begun exploring the use of text-based queries,where the need for support keypoints is eliminated. However, the optimal use oftextual descriptions for keypoints remains an underexplored area. In this work,we introduce CapeLLM, a novel approach that leverages a text-based multimodallarge language model (MLLM) for CAPE. Our method only employs query image anddetailed text descriptions as an input to estimate category-agnostic keypoints.We conduct extensive experiments to systematically explore the design space ofLLM-based CAPE, investigating factors such as choosing the optimal descriptionfor keypoints, neural network architectures, and training strategies. Thanks tothe advanced reasoning capabilities of the pre-trained MLLM, CapeLLMdemonstrates superior generalization and robust performance. Our approach setsa new state-of-the-art on the MP-100 benchmark in the challenging 1-shotsetting, marking a significant advancement in the field of category-agnosticpose estimation.

Source PDF