Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

We present Modular interactive VOS (MiVOS) framework which decouplesinteraction-to-mask and mask propagation, allowing for higher generalizabilityand better performance. Trained separately, the interaction module convertsuser interactions to an object mask, which is then temporally propagated by ourpropagation module using a novel top-$k$ filtering strategy in reading thespace-time memory. To effectively take the user's intent into account, a noveldifference-aware module is proposed to learn how to properly fuse the masksbefore and after each interaction, which are aligned with the target frames byemploying the space-time memory. We evaluate our method both qualitatively andquantitatively with different forms of user interactions (e.g., scribbles,clicks) on DAVIS to show that our method outperforms current state-of-the-artalgorithms while requiring fewer frame interactions, with the additionaladvantage in generalizing to different types of user interactions. Wecontribute a large-scale synthetic VOS dataset with pixel-accurate segmentationof 4.8M frames to accompany our source codes to facilitate future research.