2 months ago

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng

Abstract

Vision-language pre-training has significantly elevated performance across awide range of image-language applications. Yet, the pre-training process forvideo-related tasks demands exceptionally large computational and dataresources, which hinders the progress of video-language models. This paperinvestigates a straightforward, highly efficient, and resource-light approachto adapting an existing image-language pre-trained model for dense videounderstanding. Our preliminary experiments reveal that directly fine-tuningpre-trained image-language models with multiple frames as inputs on videodatasets leads to performance saturation or even a drop. Our furtherinvestigation reveals that it is largely attributed to the bias of learnedhigh-norm visual features. Motivated by this finding, we propose a simple buteffective pooling strategy to smooth the feature distribution along thetemporal dimension and thus reduce the dominant impacts from the extremefeatures. The new model is termed Pooling LLaVA, or in short. achieves new state-of-the-art performance on modern benchmarkdatasets for both video question-answer and captioning tasks. Notably, on therecent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of5 on average of five evaluated dimensions, exceeding the previous SOTA resultsfrom GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench,PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higherthan GPT4V (IG-VLM). Code is available athttps://github.com/magic-research/PLLaVA.