Time Blindness: Why Video-Language Models Can't See What Humans Can?

Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny

발행일: 6/2/2025

Time Blindness: Why Video-Language Models Can't See What Humans Can?

초록

Recent advances in vision-language models (VLMs) have made impressive stridesin understanding spatio-temporal relationships in videos. However, when spatialinformation is obscured, these models struggle to capture purely temporalpatterns. We introduce SpookyBench, a benchmark where information isencoded solely in temporal sequences of noise-like frames, mirroring naturalphenomena from biological signaling to covert communication. Interestingly,while humans can recognize shapes, text, and patterns in these sequences withover 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performancegap highlights a critical limitation: an over-reliance on frame-level spatialfeatures and an inability to extract meaning from temporal cues. Furthermore,when trained in data sets with low spatial signal-to-noise ratios (SNR),temporal understanding of models degrades more rapidly than human perception,especially in tasks requiring fine-grained temporal reasoning. Overcoming thislimitation will require novel architectures or training paradigms that decouplespatial dependencies from temporal processing. Our systematic analysis showsthat this issue persists across model scales and architectures. We releaseSpookyBench to catalyze research in temporal pattern recognition and bridge thegap between human and machine video understanding. Dataset and code has beenmade available on our project website: https://timeblindness.github.io/.

논문 세부 정보 보기 View Code