HyperAIHyperAI
2 months ago

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
  Multimodal Models
Abstract

Visual instruction tuning has made considerable strides in enhancing thecapabilities of Large Multimodal Models (LMMs). However, existing open LMMslargely focus on single-image tasks, their applications to multi-imagescenarios remains less explored. Additionally, prior LMM research separatelytackles different scenarios, leaving it impossible to generalize crossscenarios with new emerging capabilities. To this end, we introduceLLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame(video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. Toenable these capabilities, we regard the interleaved data format as a generaltemplate and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4primary domains with 14 tasks and 41 datasets. We also curate theLLaVA-Interleave Bench to comprehensively evaluate the multi-image performanceof LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leadingresults in multi-image, video, and 3D benchmarks, while maintaining theperformance of single-image tasks. Besides, our model also exhibits severalemerging capabilities, e.g., transferring tasks across different settings andmodalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models | Latest Papers | HyperAI