GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

Cognitive science has shown that humans perceive videos in terms of eventsseparated by the state changes of dominant subjects. State changes trigger newevents and are one of the most useful among the large amount of redundantinformation perceived. However, previous research focuses on the overallunderstanding of segments without evaluating the fine-grained status changesinside. In this paper, we introduce a new dataset called Kinetic-GEB+. Thedataset consists of over 170k boundaries associated with captions describingstatus changes in the generic events in 12K videos. Upon this new dataset, wepropose three tasks supporting the development of a more fine-grained, robust,and human-like understanding of videos through status changes. We evaluate manyrepresentative baselines in our dataset, where we also design a new TPD(Temporal-based Pairwise Difference) Modeling method for visual difference andachieve significant performance improvements. Besides, the results show thereare still formidable challenges for current methods in the utilization ofdifferent granularities, representation of visual difference, and the accuratelocalization of status changes. Further analysis shows that our dataset candrive developing more powerful methods to understand status changes and thusimprove video level comprehension. The dataset including both videos andboundaries is available at https://yuxuan-w.github.io/GEB-plus/