HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang Yukang Chen Zhuotao Tian Chengyao Wang Jingyao Li Bei Yu Jiaya Jia

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Abstract

Recent advancements in vision-language models have enhanced performance byincreasing the length of visual tokens, making them much longer than texttokens and significantly raising computational costs. However, we observe thatthe visual tokens generated by popular vision encoders, such as CLIP andSigLIP, contain significant redundancy. To address this, we introduceVisionZip, a simple yet effective method that selects a set of informativetokens for input to the language model, reducing visual token redundancy andimproving efficiency while maintaining model performance. The proposedVisionZip can be widely applied to image and video understanding tasks and iswell-suited for multi-turn dialogues in real-world scenarios, where previousmethods tend to underperform. Experimental results show that VisionZipoutperforms the previous state-of-the-art method by at least 5% performancegains across nearly all settings. Moreover, our method significantly enhancesmodel inference speed, improving the prefilling time by 8x and enabling theLLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model whileachieving better results. Furthermore, we analyze the causes of this redundancyand encourage the community to focus on extracting better visual featuresrather than merely increasing token length. Our code is available athttps://github.com/dvlab-research/VisionZip .

Code Repositories

dvlab-research/visionzip
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-mm-vetVisionZip (Retain 128 Tokens, fine-tuning)
GPT-4 score: 32.9
visual-question-answering-on-mm-vetVisionZip (Retain 64 Tokens, fine-tuning)
GPT-4 score: 30.2
visual-question-answering-on-mm-vetVisionZip (Retain 128 Tokens)
GPT-4 score: 32.6
visual-question-answering-on-mm-vetVisionZip (Retain 192 Tokens, fine-tuning)
GPT-4 score: 32.6
visual-question-answering-on-mm-vetVisionZip (Retain 192 Tokens)
GPT-4 score: 31.7
visual-question-answering-on-mm-vetVisionZip (Retain 64 Tokens)
GPT-4 score: 31.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VisionZip: Longer is Better but Not Necessary in Vision Language Models | Papers | HyperAI