HyperAI

AVA, short for Atomic Visual Actions, is a video dataset with audio-visual annotations designed to train robots to understand human activities. Each video clip is annotated in detail by annotators, and these annotations reflect the diverse scenes, recording conditions, and expressions of human activities.

The dataset annotations include:

Kinetics (AVA-Kinetics): It is a cross between AVA and Kinetics. In order to provide localized action labels on a wider range of visual scenes, the authors provide AVA action labels on Kinetics-700 videos, almost doubling the total number of annotations and increasing the number of videos of certain specific categories by more than 500 times.
Actions (AvA-Actions): The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute movie clips. These actions are located in space and time, generating 1.62 million action labels, a large number of which are frequently used.
Spoken Activity (AVA ActiveSpeaker, AVA Speech): AVA ActiveSpeaker links sounds and visible faces in AVA v1.0 videos, annotating 3.65 million frames on ~39,000 faces. AVA Speech densely annotates speech activities in AVA v1.0 videos and explicitly annotates 3 background noise conditions, producing ~4,600 annotated clips over 45 hours.

AVA Action Recognition Dataset