2 months ago

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra

View Paper Details

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language
Understanding

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress inunderstanding and analyzing video content. However, processing long videosremains a significant challenge constrained by LLM's context size. To addressthis limitation, we propose LongVU, a spatiotemporal adaptive compressionmechanism thats reduces the number of video tokens while preserving visualdetails of long videos. Our idea is based on leveraging cross-modal query andinter-frame dependencies to adaptively reduce temporal and spatial redundancyin videos. Specifically, we leverage DINOv2 features to remove redundant framesthat exhibit high similarity. Then we utilize text-guided cross-modal query forselective frame feature reduction. Further, we perform spatial token reductionacross frames based on their temporal dependencies. Our adaptive compressionstrategy effectively processes a large number of frames with little visualinformation loss within given context length. Our LongVU consistently surpassexisting methods across a variety of video understanding benchmarks, especiallyon hour-long video understanding tasks such as VideoMME and MLVU. Given alight-weight LLM, our LongVU also scales effectively into a smaller size withstate-of-the-art video understanding performance.