VideoRewardBench Video Reward Model Evaluation Dataset
VideoRewardBench, jointly developed by the University of Science and Technology of China and Huawei Noah's Ark Lab, is the first comprehensive evaluation benchmark in 2025 that fully covers four core dimensions of video understanding: perception, knowledge, reasoning, and security. Related research papers include... VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video UnderstandingThe aim is to systematically evaluate the model's ability to make preference judgments and quality assessments of generated results in complex video understanding scenarios.
The dataset contains 1,563 labeled samples, involving 1,482 different videos and 1,559 different questions. Each sample consists of a video-text prompt, a preferred response, and a rejected response.
Dataset distribution:
Distributed by task dimension, the dataset covers five core evaluation dimensions, and the overall distribution is relatively balanced.
- Long-form perception: 283 groups (18.1%)
- Short-form perception: 413 groups (26.4%)
- Knowledge: 238 sets (15.2%)
- Reasoning: 278 groups (17.8%)
- Safety: 351 sets (22.5%)
Based on the distribution of video duration, the videos are predominantly short in length:
- ≤ 1 minute: 59.9%
- 1–5 minutes: 33.21 TP3T
- > 5 minutes: 6.9%
Statistics by text
- Average question length: 28.8 words
- Average response length: 103.8 words
- Average length of preferred/rejected responses: 102.9 / 104.6 words
The similar length distribution of preferred and rejected answers indicates that preference labeling is primarily determined by answer quality rather than text length differences.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.