HyperAIHyperAI

Command Palette

Search for a command to run...

Vript English video-text Dataset

Join the Discord Community
Featured Image

Vript is a fine-grained video text dataset with high-resolution videos, which contains 12k annotated videos and more than 420k clips in total. Each clip in the Vript dataset is accompanied by a caption of about 145 words, which is much longer than the annotations of most video text datasets, providing a more detailed and dense description. The annotations of this dataset are inspired by video scripts, similar to the scripts written before making a video to organize how to shoot a scene.

Unlike previous video text datasets, Vript not only records the video content, but also includes the shot type (such as medium shot, close-up, etc.) and camera movement (such as pan, tilt, etc.), thereby enhancing the richness of video captions. In addition, Vript also transcribes the narration into text and provides it together with the video title to provide more context for video annotations.

This dataset was released by Shanghai Jiao Tong University, Beihang University and Xiaohongshu team in 2024. The related paper results are "Vript: A Video Is Worth Thousands of Words"

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp