Video Based Generative Performance 4
"Video-based Generative Performance Benchmarking (Temporal Understanding)" is a benchmarking task designed to evaluate the temporal understanding capabilities of generative video dialogue models. This task constructs a test set based on the ActivityNet-200 dataset, which includes rich dense descriptive captions and human-annotated question-answer pairs. The evaluation pipeline developed using the GPT-3.5 model provides a relative score from 1 to 5 for the generated predictions, aiming to comprehensively measure the model's ability to understand and generate content along the video timeline, thereby enhancing the human-computer interaction experience.