HyperAIHyperAI

Command Palette

Search for a command to run...

Verse-Bench audio-visual Joint Generation Evaluation Dataset

Date

19 days ago

Organization

StepFun
The Hong Kong University of Science and Technology(GuangZhou)

Paper URL

2509.06155

License

Apache 2.0

Join the Discord Community

Verse-Bench is a benchmark dataset for evaluating the joint generation of audio and video, released in 2025 by StepFun in collaboration with the Hong Kong University of Science and Technology, the Hong Kong University of Science and Technology (Guangzhou) and other institutions. The relevant paper results are "UniVerse-1: Unified Audio-Video Generation via Stitching of Experts", which aims to push generative models to not only generate videos, but also maintain strict temporal alignment with audio content (including ambient sound and speech).

The dataset contains 600 image-text prompt pairs, sourced from YouTube, Bilibili, TikTok video frames, movie/anime screenshots, AI model generated images, and public web images.

Data distribution

The dataset is divided into three subsets (Set1-I, Set2-V, and Set3-Ted), covering a variety of audio categories, such as human voices, animal sounds, instrumental music, natural sounds, human-object interaction sounds, object impacts, and mechanical noises, respectively, suitable for different scenarios and content types. The specific distribution is as follows:

  • Set 1-I contains 205 image-text pairs, including AI-generated images, web scraping, and media screenshots. Each image is used as the visual input, and the corresponding video/audio captions and speech content are generated by a large language model (LLM) and human annotation.
  • Set2-V contains 295 samples of short video clips from YouTube and BiliBili, which are accompanied by subtitles generated by LLM and transcribed text using Whisper for automatic speech recognition (ASR) and manually verified.
  • Set3-Ted contains TED talk videos from September 2025, with a total of 100 samples, using the same annotation process as Set2.
Dataset Example

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp