HyperAI

Vript is a fine-grained video text dataset with high-resolution videos, which contains 12k annotated videos and more than 420k clips in total. Each clip in the Vript dataset is accompanied by a caption of about 145 words, which is much longer than the annotations of most video text datasets, providing a more detailed and dense description. The annotations of this dataset are inspired by video scripts, similar to the scripts written before making a video to organize how to shoot a scene.

Unlike previous video text datasets, Vript not only records the video content, but also includes the shot type (such as medium shot, close-up, etc.) and camera movement (such as pan, tilt, etc.), thereby enhancing the richness of video captions. In addition, Vript also transcribes the narration into text and provides it together with the video title to provide more context for video annotations.

This dataset was released by Shanghai Jiao Tong University, Beihang University and Xiaohongshu team in 2024. The related paper results are "Vript: A Video Is Worth Thousands of Words"

Vript English video-text Dataset