2 months ago

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free videolarge language model (LLM) that can jointly capture the detailed spatialsemantics and long-range temporal context without exceeding the token budget ofcommonly used LLMs. This is realized by using a two-stream SlowFast design ofinputs for Video LLMs to aggregate features from sampled video frames in aneffective way. Specifically, the Slow pathway extracts features at a low framerate while keeping as many spatial details as possible (e.g., with 24x24tokens), and the Fast pathway operates on a high frame rate but uses a largerspatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. Asa result, this design allows us to adequately capture both spatial and temporalfeatures that are beneficial for understanding details along the video.Experimental results show that SF-LLaVA outperforms existing training-freemethods on a wide range of video tasks. On some benchmarks, it achievescomparable or even better performance compared to state-of-the-art Video LLMsthat are fine-tuned on video datasets.