Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living

Video transformers have become the de facto standard for human actionrecognition, yet their exclusive reliance on the RGB modality still limitstheir adoption in certain domains. One such domain is Activities of DailyLiving (ADL), where RGB alone is not sufficient to distinguish between visuallysimilar actions, or actions observed from multiple viewpoints. To facilitatethe adoption of video transformers for ADL, we hypothesize that theaugmentation of RGB with human pose information, known for its sensitivity tofine-grained motion and multiple viewpoints, is essential. Consequently, weintroduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), anovel approach that augments the RGB representations learned by videotransformers with 2D and 3D pose information. The key elements of $\pi$-ViT aretwo plug-in modules, 2D Skeleton Induction Module and 3D Skeleton InductionModule, that are responsible for inducing 2D and 3D pose information into theRGB representations. These modules operate by performing pose-aware auxiliarytasks, a design choice that allows $\pi$-ViT to discard the modules duringinference. Notably, $\pi$-ViT achieves the state-of-the-art performance onthree prominent ADL datasets, encompassing both real-world and large-scaleRGB-D datasets, without requiring poses or additional computational overhead atinference.