Local All-Pair Correspondence for Point Tracking

We introduce LocoTrack, a highly accurate and efficient model designed forthe task of tracking any point (TAP) across video sequences. Previousapproaches in this task often rely on local 2D correlation maps to establishcorrespondences from a point in the query image to a local region in the targetimage, which often struggle with homogeneous regions or repetitive features,leading to matching ambiguities. LocoTrack overcomes this challenge with anovel approach that utilizes all-pair correspondences across regions, i.e.,local 4D correlation, to establish precise correspondences, with bidirectionalcorrespondence and matching smoothness significantly enhancing robustnessagainst ambiguities. We also incorporate a lightweight correlation encoder toenhance computational efficiency, and a compact Transformer architecture tointegrate long-term temporal information. LocoTrack achieves unmatched accuracyon all TAP-Vid benchmarks and operates at a speed almost 6 times faster thanthe current state-of-the-art.