HalluciNet-ing Spatiotemporal Representations Using a 2D-CNN

Spatiotemporal representations learned using 3D convolutional neural networks(CNN) are currently used in state-of-the-art approaches for action relatedtasks. However, 3D-CNN are notorious for being memory and compute resourceintensive as compared with more simple 2D-CNN architectures. We propose tohallucinate spatiotemporal representations from a 3D-CNN teacher with a 2D-CNNstudent. By requiring the 2D-CNN to predict the future and intuit upcomingactivity, it is encouraged to gain a deeper understanding of actions and howthey evolve. The hallucination task is treated as an auxiliary task, which canbe used with any other action related task in a multitask learning setting.Thorough experimental evaluation shows that the hallucination task indeed helpsimprove performance on action recognition, action quality assessment, anddynamic scene recognition tasks. From a practical standpoint, being able tohallucinate spatiotemporal representations without an actual 3D-CNN can enabledeployment in resource-constrained scenarios, such as with limited computingpower and/or lower bandwidth. Codebase is available here:https://github.com/ParitoshParmar/HalluciNet.