Locate then Segment: A Strong Pipeline for Referring Image Segmentation

Referring image segmentation aims to segment the objects referred by anatural language expression. Previous methods usually focus on designing animplicit and recurrent feature interaction mechanism to fuse thevisual-linguistic features to directly generate the final segmentation maskwithout explicitly modeling the localization information of the referentinstances. To tackle these problems, we view this task from another perspectiveby decoupling it into a "Locate-Then-Segment" (LTS) scheme. Given a languageexpression, people generally first perform attention to the correspondingtarget image regions, then generate a fine segmentation mask about the objectbased on its context. The LTS first extracts and fuses both visual and textualfeatures to get a cross-modal representation, then applies a cross-modelinteraction on the visual-textual features to locate the referred object withposition prior, and finally generates the segmentation result with alight-weight segmentation network. Our LTS is simple but surprisinglyeffective. On three popular benchmark datasets, the LTS outperforms all theprevious state-of-the-art methods by a large margin (e.g., +3.2% on RefCOCO+and +3.4% on RefCOCOg). In addition, our model is more interpretable withexplicitly locating the object, which is also proved by visualizationexperiments. We believe this framework is promising to serve as a strongbaseline for referring image segmentation.