Visual Grounding
Visual grounding (VG) aims to locate the most relevant object or region in an image based on a natural language query. The core challenges of this task include identifying the main point of focus in the query, understanding the content of the image, and accurately locating the target object. Visual grounding not only enhances the naturality and accuracy of human-computer interaction but also has significant application value in areas such as image annotation, content retrieval, and scene understanding.