HyperAI

AutoCaption Video Caption Benchmark Dataset

Download Help

The AutoCaption dataset is a video caption benchmark dataset released by Tjunlp Lab in 2025. The related paper results are "Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search", which aims to promote the research of multimodal large language models in the field of video subtitle generation.

Dataset structure:

The dataset contains 2 subsets, with a total of 11,184 samples:

  • sft_data: supervised fine-tuning for subtitle models (9,419 samples for supervised fine-tuning data)
  • mcts_vcb: Evaluated using MCTS-generated captions and keypoints (1,765 samples for evaluating the MCTS-VCB benchmark)