8 months ago

Abstract

Referring Image Segmentation (RIS) is an advanced vision-language task thatinvolves identifying and segmenting objects within an image as described byfree-form text descriptions. While previous studies focused on aligning visualand language features, exploring training techniques, such as dataaugmentation, remains underexplored. In this work, we explore effective dataaugmentation for RIS and propose a novel training framework called MaskedReferring Image Segmentation (MaskRIS). We observe that the conventional imageaugmentations fall short of RIS, leading to performance degradation, whilesimple random masking significantly enhances the performance of RIS. MaskRISuses both image and text masking, followed by Distortion-aware ContextualLearning (DCL) to fully exploit the benefits of the masking strategy. Thisapproach can improve the model's robustness to occlusions, incompleteinformation, and various linguistic complexities, resulting in a significantperformance improvement. Experiments demonstrate that MaskRIS can easily beapplied to various RIS models, outperforming existing methods in both fullysupervised and weakly supervised settings. Finally, MaskRIS achieves newstate-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Codeis available at https://github.com/naver-ai/maskris.

Source PDF