8 months ago

Abstract

While Multi-Object Tracking (MOT) has made substantial advancements, it islimited by heavy reliance on prior knowledge and limited to predefinedcategories. In contrast, Generic Multiple Object Tracking (GMOT), trackingmultiple objects with similar appearance, requires less prior information aboutthe targets but faces challenges with variants like viewpoint, lighting,occlusion, and resolution. Our contributions commence with the introduction ofthe \textbf{\text{Refer-GMOT dataset}} a collection of videos, each accompaniedby fine-grained textual descriptions of their attributes. Subsequently, weintroduce a novel text prompt-based open-vocabulary GMOT framework, called\textbf{\text{TP-GMOT}}, which can track never-seen object categories with zerotraining examples. Within \text{TP-GMOT} framework, we introduce two novelcomponents: (i) {\textbf{\text{TP-OD}}, an object detection by a textualprompt}, for accurately detecting unseen objects with specific characteristics.(ii) Motion-Appearance Cost SORT \textbf{\text{MAC-SORT}}, a novel objectassociation approach that adeptly integrates motion and appearance-basedmatching strategies to tackle the complex task of tracking multiple genericobjects with high similarity. Our contributions are benchmarked on the\text{Refer-GMOT} dataset for GMOT task. Additionally, to assess thegeneralizability of the proposed \text{TP-GMOT} framework and the effectivenessof \text{MAC-SORT} tracker, we conduct ablation studies on the DanceTrack andMOT20 datasets for the MOT task. Our dataset, code, and models will be publiclyavailable at: https://fsoft-aic.github.io/TP-GMOT

Source PDF View Code