Guided Attention for Interpretable Motion Captioning

Diverse and extensive work has recently been conducted on text-conditionedhuman motion generation. However, progress in the reverse direction, motioncaptioning, has seen less comparable advancement. In this paper, we introduce anovel architecture design that enhances text generation quality by emphasizinginterpretability through spatio-temporal and adaptive attention mechanisms. Toencourage human-like reasoning, we propose methods for guiding attention duringtraining, emphasizing relevant skeleton areas over time and distinguishingmotion-related words. We discuss and quantify our model's interpretabilityusing relevant histograms and density distributions. Furthermore, we leverageinterpretability to derive fine-grained information about human motion,including action localization, body part identification, and the distinction ofmotion-related words. Finally, we discuss the transferability of our approachesto other tasks. Our experiments demonstrate that attention guidance leads tointerpretable captioning while enhancing performance compared to higherparameter-count, non-interpretable state-of-the-art systems. The code isavailable at: https://github.com/rd20karim/M2T-Interpretable.