MAttNet: Modular Attention Network for Referring Expression Comprehension

In this paper, we address referring expression comprehension: localizing animage region described by a natural language expression. While most recent worktreats expressions as a single unit, we propose to decompose them into threemodular components related to subject appearance, location, and relationship toother objects. This allows us to flexibly adapt to expressions containingdifferent types of information in an end-to-end framework. In our model, whichwe call the Modular Attention Network (MAttNet), two types of attention areutilized: language-based attention that learns the module weights as well asthe word/phrase attention that each module should focus on; and visualattention that allows the subject and relationship modules to focus on relevantimage components. Module weights combine scores from all three modulesdynamically to output an overall score. Experiments show that MAttNetoutperforms previous state-of-art methods by a large margin on bothbounding-box-level and pixel-level comprehension tasks. Demo and code areprovided.