SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

The Segment Anything Model 2 (SAM 2) has demonstrated strong performance inobject segmentation tasks but faces challenges in visual object tracking,particularly when managing crowded scenes with fast-moving or self-occludingobjects. Furthermore, the fixed-window memory approach in the original modeldoes not consider the quality of memories selected to condition the imagefeatures for the next frame, leading to error propagation in videos. This paperintroduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed forvisual object tracking. By incorporating temporal motion cues with the proposedmotion-aware memory selection mechanism, SAMURAI effectively predicts objectmotion and refines mask selection, achieving robust, accurate tracking withoutthe need for retraining or fine-tuning. SAMURAI operates in real-time anddemonstrates strong zero-shot performance across diverse benchmark datasets,showcasing its ability to generalize without fine-tuning. In evaluations,SAMURAI achieves significant improvements in success rate and precision overexisting trackers, with a 7.1% AUC gain on LaSOT_{ext} and a 3.5% AOgain on GOT-10k. Moreover, it achieves competitive results compared to fullysupervised methods on LaSOT, underscoring its robustness in complex trackingscenarios and its potential for real-world applications in dynamicenvironments. Code and results are available athttps://github.com/yangchris11/samurai.