Date

2 months ago

Organization

Paper URL

License

MIT

Dataset distribution:

Distributed by task dimension, the dataset covers five core evaluation dimensions, and the overall distribution is relatively balanced.

Long-form perception: 283 groups (18.1%)
Short-form perception: 413 groups (26.4%)
Knowledge: 238 sets (15.2%)
Reasoning: 278 groups (17.8%)
Safety: 351 sets (22.5%)

Based on the distribution of video duration, the videos are predominantly short in length:

≤ 1 minute: 59.9%
1–5 minutes: 33.21 TP3T
> 5 minutes: 6.9%

Statistics by text

Average question length: 28.8 words
Average response length: 103.8 words
Average length of preferred/rejected responses: 102.9 / 104.6 words

The similar length distribution of preferred and rejected answers indicates that preference labeling is primarily determined by answer quality rather than text length differences.

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset Discuss on Discord

Date

2 months ago

Organization

Paper URL

2509.00484

License

MIT

Dataset distribution:

Distributed by task dimension, the dataset covers five core evaluation dimensions, and the overall distribution is relatively balanced.

Long-form perception: 283 groups (18.1%)
Short-form perception: 413 groups (26.4%)
Knowledge: 238 sets (15.2%)
Reasoning: 278 groups (17.8%)
Safety: 351 sets (22.5%)

Based on the distribution of video duration, the videos are predominantly short in length:

≤ 1 minute: 59.9%
1–5 minutes: 33.21 TP3T
> 5 minutes: 6.9%

Statistics by text

Average question length: 28.8 words
Average response length: 103.8 words
Average length of preferred/rejected responses: 102.9 / 104.6 words

The similar length distribution of preferred and rejected answers indicates that preference labeling is primarily determined by answer quality rather than text length differences.

Related Datasets

HumanSense Benchmark Dataset

3 months ago

VenusBench-GD Cross-Platform Interface Understanding Dataset

a month ago

DetectiumFire Multimodal Fire Understanding Dataset

2 months ago

SimpleQA Concise Factual Question Answering Evaluation Dataset

a month ago

EditReward-Bench Image Editing Evaluation Dataset

3 months ago

5.08 GB61

VERA Voice Reasoning Evaluation Dataset

3 months ago

2.37 GB59

GroundingME Complex Scene Understanding Evaluation Dataset

a month ago

Spatial-SSRL-81k Spatial Awareness Self-Supervised Dataset

2 months ago

PhysToolBench Physics Tool Task Dataset

2 months ago

1.56 GB56

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

VideoRewardBench Video Reward Model Evaluation Dataset

Dataset distribution:

Build AI with AI

HyperAI Newsletters

Command Palette

VideoRewardBench Video Reward Model Evaluation Dataset

Dataset distribution:

Related Datasets

HumanSense Benchmark Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

DetectiumFire Multimodal Fire Understanding Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

EditReward-Bench Image Editing Evaluation Dataset

VERA Voice Reasoning Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

Spatial-SSRL-81k Spatial Awareness Self-Supervised Dataset

PhysToolBench Physics Tool Task Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

VideoRewardBench Video Reward Model Evaluation Dataset

Dataset distribution:

Related Datasets

HumanSense Benchmark Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

DetectiumFire Multimodal Fire Understanding Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

EditReward-Bench Image Editing Evaluation Dataset

VERA Voice Reasoning Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

Spatial-SSRL-81k Spatial Awareness Self-Supervised Dataset

PhysToolBench Physics Tool Task Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

HumanSense Benchmark Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

DetectiumFire Multimodal Fire Understanding Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

EditReward-Bench Image Editing Evaluation Dataset

VERA Voice Reasoning Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

Spatial-SSRL-81k Spatial Awareness Self-Supervised Dataset

PhysToolBench Physics Tool Task Dataset

Related Datasets

HumanSense Benchmark Dataset

VenusBench-GD Cross-Platform Interface Understanding Dataset

DetectiumFire Multimodal Fire Understanding Dataset

SimpleQA Concise Factual Question Answering Evaluation Dataset

EditReward-Bench Image Editing Evaluation Dataset

VERA Voice Reasoning Evaluation Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

Spatial-SSRL-81k Spatial Awareness Self-Supervised Dataset

PhysToolBench Physics Tool Task Dataset