2 months ago

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with theadvancement of instruction tuning. However, while existing models canunderstand images and videos at a holistic level, they still struggle withinstance-level understanding that requires a more nuanced comprehension andalignment. Instance-level understanding is crucial, as it focuses on thespecific elements that we are most interested in. Excitingly, existing worksfind that the state-of-the-art LMMs exhibit strong instance understandingcapabilities when provided with explicit visual cues. Motivated by this, weintroduce an automated annotation pipeline assisted by GPT-4o to extractinstance-level information from images and videos through explicit visualprompting for instance guidance. Building upon this pipeline, we proposedInst-IT, a solution to enhance LMMs in Instance understanding via explicitvisual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnosemultimodal instance-level understanding, a large-scale instruction-tuningdataset, and a continuous instruction-tuning training paradigm to effectivelyenhance spatial-temporal instance understanding capabilities of existing LMMs.Experimental results show that, with the boost of Inst-IT, our models not onlyachieve outstanding performance on Inst-IT Bench but also demonstratesignificant improvements across various generic image and video understandingbenchmarks. This highlights that our dataset not only boosts instance-levelunderstanding but also strengthens the overall capabilities of generic imageand video comprehension.