DiDeMo Temporal Positioning Dataset
Date
Size
Publish URL
License
其他
Categories

DiDeMo stands for Distinct Describable Moments, which can be used to locate events in a video in time given a natural language description. The videos in the dataset are collected from Flickr, and each video is edited into segments of up to 30 seconds. The videos in the dataset are divided into segments of 5 seconds each to reduce the complexity of annotation.
The dataset is divided into training, validation and test sets, which contain 8,395, 1,065 and 1,004 videos respectively. The dataset contains a total of 26,892 moments, and a moment may be associated with descriptions from multiple annotators. The descriptions in the DiDeMo dataset are detailed and include camera movements, time transition indicators and activities. In addition, the descriptions in the dataset are verified, so each description refers to a single moment.