Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Multiple object tracking and segmentation requires detecting, tracking, andsegmenting objects belonging to a set of given classes. Most approaches onlyexploit the temporal dimension to address the association problem, whilerelying on single frame predictions for the segmentation mask itself. Wepropose Prototypical Cross-Attention Network (PCAN), capable of leveraging richspatio-temporal information for online multiple object tracking andsegmentation. PCAN first distills a space-time memory into a set of prototypesand then employs cross-attention to retrieve rich information from the pastframes. To segment each object, PCAN adopts a prototypical appearance module tolearn a set of contrastive foreground and background prototypes, which are thenpropagated over time. Extensive experiments demonstrate that PCAN outperformscurrent video instance tracking and segmentation competition winners on bothYoutube-VIS and BDD100K datasets, and shows efficacy to both one-stage andtwo-stage segmentation frameworks. Code and video resources are available athttp://vis.xyz/pub/pcan.