Ref-AVS Audio-Visual Scene Segmentation Dataset
Date
Size
Publish URL
The Ref-AVS dataset was released in 2024 by researchers from Renmin University of China, Beijing University of Posts and Telecommunications, and Shanghai Artificial Intelligence Laboratory.Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes", has been accepted by ECCV2024.
The Ref-AVS dataset is a benchmark for object segmentation tasks in audio-visual scenes that provides pixel-level annotations and aims to facilitate the development of multimodal machine learning models, especially in complex tasks involving the fusion of audio and visual information.
The research team selected multiple audible objects in 48 categories: 20 musical instruments, 8 animals, 15 machines, and 5 humans. Annotations were collected using the team's custom-made GSAI tagging system.
During the video collection process, the research team adopted the techniques introduced in the literature [3,47] to ensure that the audio and video clips are aligned with the expected semantics. All videos are sourced from YouTube's Creative Commons license agreement, and each video is trimmed to a length of 10 seconds. Throughout the manual collection process, the videos were intentionally avoided from being classified into several categories: 1) videos with a large number of identical semantic quantities; 2) videos with a large number of editing and camera switching attributes; 3) non-real videos containing synthetic artifacts.