QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen

تاريخ النشر: 5/25/2025

QuickVideo: Real-Time Long Video Understanding with System Algorithm
Co-Design

الملخص

Long-video understanding has emerged as a crucial capability in real-worldapplications such as video surveillance, meeting summarization, educationallecture analysis, and sports broadcasting. However, it remains computationallyprohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequentialvideo decoding, the process of converting the raw bit stream to RGB frames cantake up to a minute for hour-long video inputs, and 2) costly prefilling of upto several million tokens for LLM inference, resulting in high latency andmemory use. To address these challenges, we propose QuickVideo, asystem-algorithm co-design that substantially accelerates long-videounderstanding to support real-time downstream applications. It comprises threekey innovations: QuickDecoder, a parallelized CPU-based video decoder thatachieves 2-3 times speedup by splitting videos into keyframe-aligned intervalsprocessed concurrently; QuickPrefill, a memory-efficient prefilling methodusing KV-cache pruning to support more frames with less GPU memory; and anoverlapping scheme that overlaps CPU video decoding with GPU inference.Together, these components infernece time reduce by a minute on long videoinputs, enabling scalable, high-quality video understanding even on limitedhardware. Experiments show that QuickVideo generalizes across durations andsampling rates, making long video processing feasible in practice.

عرض تفاصيل الورقة البحثية