Multimodal Fusion via Teacher-Student Network for Indoor Action Recognition
Indoor action recognition plays an important role in modernsociety, such as intelligent healthcare in large mobile cabinhospitals. With the wide usage of depth sensors like Kinect,multimodal information including skeleton and RGB modalitiesbrings a promising way to improve the performance.However, existing methods are either focusing on a singledata modality or failed to take the advantage of multiple datamodalities. In this paper, we propose a Teacher-Student MultimodalFusion (TSMF) model that fuses the skeleton andRGB modalities at the model level for indoor action recognition.In our TSMF, we utilize a teacher network to transferthe structural knowledge of the skeleton modality to astudent network for the RGB modality. With extensive experimentson two benchmarking datasets: NTU RGB+D andPKU-MMD, results show that the proposed TSMF consistentlyperforms better than state-of-the-art single modal andmultimodal methods. It also indicates that our TSMF couldnot only improve the accuracy of the student network but alsosignificantly improve the ensemble accuracy.