Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

Unsupervised video object segmentation aims to detect the most salient objectin a video without any external guidance regarding the object. Salient objectsoften exhibit distinctive movements compared to the background, and recentmethods leverage this by combining motion cues from optical flow maps withappearance cues from RGB images. However, because optical flow maps are oftenclosely correlated with segmentation masks, networks can become overlydependent on motion cues during training, leading to vulnerability when facedwith confusing motion cues and resulting in unstable predictions. To addressthis challenge, we propose a novel motion-as-option network that treats motioncues as an optional component rather than a necessity. During training, werandomly input RGB images into the motion encoder instead of optical flow maps,which implicitly reduces the network's reliance on motion cues. This designensures that the motion encoder is capable of processing both RGB images andoptical flow maps, leading to two distinct predictions depending on the type ofinput provided. To make the most of this flexibility, we introduce an adaptiveoutput selection algorithm that determines the optimal prediction duringtesting.