Command Palette
Search for a command to run...
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang Yukang Chen Zhuotao Tian Chengyao Wang Jingyao Li Bei Yu Jiaya Jia

Abstract
Recent advancements in vision-language models have enhanced performance byincreasing the length of visual tokens, making them much longer than texttokens and significantly raising computational costs. However, we observe thatthe visual tokens generated by popular vision encoders, such as CLIP andSigLIP, contain significant redundancy. To address this, we introduceVisionZip, a simple yet effective method that selects a set of informativetokens for input to the language model, reducing visual token redundancy andimproving efficiency while maintaining model performance. The proposedVisionZip can be widely applied to image and video understanding tasks and iswell-suited for multi-turn dialogues in real-world scenarios, where previousmethods tend to underperform. Experimental results show that VisionZipoutperforms the previous state-of-the-art method by at least 5% performancegains across nearly all settings. Moreover, our method significantly enhancesmodel inference speed, improving the prefilling time by 8x and enabling theLLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model whileachieving better results. Furthermore, we analyze the causes of this redundancyand encourage the community to focus on extracting better visual featuresrather than merely increasing token length. Our code is available athttps://github.com/dvlab-research/VisionZip .
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-mm-vet | VisionZip (Retain 128 Tokens, fine-tuning) | GPT-4 score: 32.9 |
| visual-question-answering-on-mm-vet | VisionZip (Retain 64 Tokens, fine-tuning) | GPT-4 score: 30.2 |
| visual-question-answering-on-mm-vet | VisionZip (Retain 128 Tokens) | GPT-4 score: 32.6 |
| visual-question-answering-on-mm-vet | VisionZip (Retain 192 Tokens, fine-tuning) | GPT-4 score: 32.6 |
| visual-question-answering-on-mm-vet | VisionZip (Retain 192 Tokens) | GPT-4 score: 31.7 |
| visual-question-answering-on-mm-vet | VisionZip (Retain 64 Tokens) | GPT-4 score: 31.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.