8 months ago

Abstract

The advent of real-time large multimodal models (LMMs) like GPT-4o hassparked considerable interest in efficient LMMs. LMM frameworks typicallyencode visual inputs into vision tokens (continuous representations) andintegrate them and textual instructions into the context of large languagemodels (LLMs), where large-scale parameters and numerous context tokens(predominantly vision tokens) result in substantial computational overhead.Previous efforts towards efficient LMMs always focus on replacing the LLMbackbone with smaller models, while neglecting the crucial issue of tokenquantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimalvision tokens. To achieve a high compression ratio of vision tokens whilepreserving visual information, we first analyze how LMMs understand visiontokens and find that most vision tokens only play a crucial role in the earlylayers of LLM backbone, where they mainly fuse visual information into texttokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion tofuse visual information into text tokens in advance, thereby facilitating theextreme compression of vision tokens fed to LLM backbone into one token.LLaVA-Mini is a unified large multimodal model that can support theunderstanding of images, high-resolution images, and videos in an efficientmanner. Experiments across 11 image-based and 7 video-based benchmarksdemonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision tokeninstead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by77%, deliver low-latency responses within 40 milliseconds, and process over10,000 frames of video on the GPU hardware with 24GB of memory.

Source PDF