BABEL: Bodies, Action and Behavior with English Labels

Understanding the semantics of human movement -- the what, how and why of themovement -- is an important problem that requires datasets of human actionswith semantic labels. Existing datasets take one of two approaches. Large-scalevideo datasets contain many action labels but do not contain ground-truth 3Dhuman motion. Alternatively, motion-capture (mocap) datasets have precise bodymotions but are limited to a small number of actions. To address this, wepresent BABEL, a large dataset with language labels describing the actionsbeing performed in mocap sequences. BABEL consists of action labels for about43 hours of mocap sequences from AMASS. Action labels are at two levels ofabstraction -- sequence labels describe the overall action in the sequence, andframe labels describe all actions in every frame of the sequence. Each framelabel is precisely aligned with the duration of the corresponding action in themocap sequence, and multiple actions can overlap. There are over 28k sequencelabels, and 63k frame labels in BABEL, which belong to over 250 unique actioncategories. Labels from BABEL can be leveraged for tasks like actionrecognition, temporal action localization, motion synthesis, etc. Todemonstrate the value of BABEL as a benchmark, we evaluate the performance ofmodels on 3D action recognition. We demonstrate that BABEL poses interestinglearning challenges that are applicable to real-world scenarios, and can serveas a useful benchmark of progress in 3D action recognition. The dataset,baseline method, and evaluation code is made available, and supported foracademic research purposes at https://babel.is.tue.mpg.de/.