HyperAI超神经

Current datasets for long-form video understanding often fail to provide a true long-form understanding challenge, since many tasks derived from these datasets can be successfully solved by analyzing one or a few random frames in a video.The research team proposed a novel dataset and benchmark, CinePile, designed for real-world long-form video understanding.

The research team leveraged advanced LLMs and human-computer interaction and built on raw data generated by humans. The comprehensive dataset contains 305,000 multiple-choice questions (MCQs) covering various visual and multimodal aspects, including temporal understanding, understanding human-object interactions, and reasoning about events or actions within a scene. In addition, recent video-centric LLMs, both open source and proprietary, were evaluated on the test portion of the dataset. The results show that even the most advanced video-centric LLMs perform significantly worse than humans in these tasks, highlighting the inherent complexity and challenges of video understanding.

CinePile Long Video Understanding Question Answering Dataset