LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodaltask that aims to identify named entities, entity types and their correspondingvisual regions. GMNER task exhibits two challenging properties: 1) The weakcorrelation between image-text pairs in social media results in a significantportion of named entities being ungroundable. 2) There exists a distinctionbetween coarse-grained referring expressions commonly used in similar tasks(e.g., phrase localization, referring expression comprehension) andfine-grained named entities. In this paper, we propose RiVEG, a unifiedframework that reformulates GMNER into a joint MNER-VE-VG task by leveraginglarge language models (LLMs) as a connecting bridge. This reformulation bringstwo benefits: 1) It maintains the optimal MNER performance and eliminates theneed for employing object detection methods to pre-extract regional features,thereby naturally addressing two major limitations of existing GMNER methods.2) The introduction of entity expansion expression and Visual Entailment (VE)module unifies Visual Grounding (VG) and Entity Grounding (EG). It enablesRiVEG to effortlessly inherit the Visual Entailment and Visual Groundingcapabilities of any current or prospective multimodal pretraining models.Extensive experiments demonstrate that RiVEG outperforms state-of-the-artmethods on the existing GMNER dataset and achieves absolute leads of 10.65%,6.21%, and 8.83% in all three subtasks.