HyperAIHyperAI
2 months ago

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Zeng, Yan ; Zhang, Xinsong ; Li, Hang
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
  Concepts
Abstract

Most existing methods in vision language pre-training rely on object-centricfeatures extracted through object detection and make fine-grained alignmentsbetween the extracted features and texts. It is challenging for these methodsto learn relations among multiple objects. To this end, we propose a new methodcalled X-VLM to perform `multi-grained vision language pre-training.' The keyto learning multi-grained alignments is to locate visual concepts in the imagegiven the associated texts, and in the meantime align the texts with the visualconcepts, where the alignments are in multi-granularity. Experimental resultsshow that X-VLM effectively leverages the learned multi-grained alignments tomany downstream vision language tasks and consistently outperformsstate-of-the-art methods.

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts | Latest Papers | HyperAI