3 months ago

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang Yukang Chen Zhuotao Tian Chengyao Wang Jingyao Li Bei Yu Jiaya Jia

Abstract

Recent advancements in vision-language models have enhanced performance byincreasing the length of visual tokens, making them much longer than texttokens and significantly raising computational costs. However, we observe thatthe visual tokens generated by popular vision encoders, such as CLIP andSigLIP, contain significant redundancy. To address this, we introduceVisionZip, a simple yet effective method that selects a set of informativetokens for input to the language model, reducing visual token redundancy andimproving efficiency while maintaining model performance. The proposedVisionZip can be widely applied to image and video understanding tasks and iswell-suited for multi-turn dialogues in real-world scenarios, where previousmethods tend to underperform. Experimental results show that VisionZipoutperforms the previous state-of-the-art method by at least 5% performancegains across nearly all settings. Moreover, our method significantly enhancesmodel inference speed, improving the prefilling time by 8x and enabling theLLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model whileachieving better results. Furthermore, we analyze the causes of this redundancyand encourage the community to focus on extracting better visual featuresrather than merely increasing token length. Our code is available athttps://github.com/dvlab-research/VisionZip .

Code Repositories

dvlab-research/visionzip

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
visual-question-answering-on-mm-vet	VisionZip (Retain 128 Tokens, fine-tuning)	GPT-4 score: 32.9
visual-question-answering-on-mm-vet	VisionZip (Retain 64 Tokens, fine-tuning)	GPT-4 score: 30.2
visual-question-answering-on-mm-vet	VisionZip (Retain 128 Tokens)	GPT-4 score: 32.6
visual-question-answering-on-mm-vet	VisionZip (Retain 192 Tokens, fine-tuning)	GPT-4 score: 32.6
visual-question-answering-on-mm-vet	VisionZip (Retain 192 Tokens)	GPT-4 score: 31.7
visual-question-answering-on-mm-vet	VisionZip (Retain 64 Tokens)	GPT-4 score: 31.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette