2 months ago

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong Juhua Hu Mian Zhang Ming Yin Yanjie Fu Qi Qian

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance inunderstanding visual content with language instruction by converting visualinput to vision tokens. However, redundancy in vision tokens results in thedegenerated inference efficiency of VLMs. While many algorithms have beenproposed to reduce the number of vision tokens, most of them apply onlyunimodal information (i.e., vision/text) for pruning and ignore the inherentmultimodal property of vision-language tasks. Moreover, it lacks a genericcriterion that can be applied to different modalities. To mitigate thislimitation, in this work, we propose to leverage both vision and text tokens toselect informative vision tokens by the criterion of coverage. We firstformulate the subset selection problem as a maximum coverage problem.Afterward, a subset of vision tokens is optimized to cover the text tokens andthe original set of vision tokens, simultaneously. Finally, a VLM agent can beadopted to further improve the quality of text tokens for guiding visionpruning. The proposed method MMTok is extensively evaluated on benchmarkdatasets with different VLMs. The comparison illustrates that vision and textinformation are complementary, and combining multimodal information can surpassthe unimodal baseline with a clear margin. Moreover, under the maximum coveragecriterion on the POPE dataset, our method achieves a 1.87x speedup whilemaintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore,with only four vision tokens, it still preserves 87.7% of the originalperformance on LLaVA-1.5-7B. These results highlight the effectiveness ofcoverage in token selection.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong Juhua Hu Mian Zhang Ming Yin Yanjie Fu Qi Qian

Abstract

Build AI with AI

Hyper Newsletters