Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Large Multimodal Models (LMMs) have made significant breakthroughs with theadvancement of instruction tuning. However, while existing models canunderstand images and videos at a holistic level, they still struggle withinstance-level understanding that requires a more nuanced comprehension andalignment. Instance-level understanding is crucial, as it focuses on thespecific elements that we are most interested in. Excitingly, existing worksfind that the state-of-the-art LMMs exhibit strong instance understandingcapabilities when provided with explicit visual cues. Motivated by this, weintroduce an automated annotation pipeline assisted by GPT-4o to extractinstance-level information from images and videos through explicit visualprompting for instance guidance. Building upon this pipeline, we proposedInst-IT, a solution to enhance LMMs in Instance understanding via explicitvisual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnosemultimodal instance-level understanding, a large-scale instruction-tuningdataset, and a continuous instruction-tuning training paradigm to effectivelyenhance spatial-temporal instance understanding capabilities of existing LMMs.Experimental results show that, with the boost of Inst-IT, our models not onlyachieve outstanding performance on Inst-IT Bench but also demonstratesignificant improvements across various generic image and video understandingbenchmarks. This highlights that our dataset not only boosts instance-levelunderstanding but also strengthens the overall capabilities of generic imageand video comprehension.