11 days ago

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

View Paper Details

List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs

Abstract

Set-of-Mark (SoM) Prompting unleashes the visual grounding capability ofGPT-4V, by enabling the model to associate visual objects with tags inserted onthe image. These tags, marked with alphanumerics, can be indexed via texttokens for easy reference. Despite the extraordinary performance from GPT-4V,we observe that other Multimodal Large Language Models (MLLMs) struggle tounderstand these visual tags. To promote the learning of SoM prompting foropen-source models, we propose a new learning paradigm: "list items one byone," which asks the model to enumerate and describe all visual tags placed onthe image following the alphanumeric orders of tags. By integrating our curateddataset with other visual instruction tuning datasets, we are able to equipexisting MLLMs with the SoM prompting ability. Furthermore, we evaluate ourfinetuned SoM models on five MLLM benchmarks. We find that this new dataset,even in a relatively small size (10k-30k images with tags), significantlyenhances visual reasoning capabilities and reduces hallucinations for MLLMs.Perhaps surprisingly, these improvements persist even when the visual tags areomitted from input images during inference. This suggests the potential of"list items one by one" as a new paradigm for training MLLMs, which strengthensthe object-text alignment through the use of visual tags in the training stage.Finally, we conduct analyses by probing trained models to understand theworking mechanism of SoM. Our code and data are available athttps://github.com/zzxslp/SoM-LLaVA.