InternVid-Full High-quality Large-scale Video-text Dataset
Date
Size
Publish URL
Tags
Categories

This dataset is a high-quality, large-scale video-text dataset jointly released by the Shanghai Artificial Intelligence Laboratory (Shanghai AI Lab), Nanjing University, the Chinese Academy of Sciences and other institutions in 2024. It aims to meet the growing demand for video-language modeling and promote further improvement in large-model video understanding and generation capabilities.
As one of the largest public video-text datasets in the world,InternVid contains over 7 million videos with detailed text descriptions, covering 16 scenes and about 6,000 action descriptions, with a total length of nearly 760,000 hours.And has high video-text correspondence, the datasetThe video and text description are highly matched, providing a "video dictionary" for training multimodal learning tasks such as video-text semantic matching, video-text retrieval, and video-text generation.
InternVid has received widespread attention in the academic community, has been applied to the multimodal world model LWM, and has been used or referenced by Google and Stable AI in video generation work. The related paper received Spotlight at the 2024 International Conference on Representation Learning (ICLR 2024).