Explore Human Parsing Modality for Action Recognition

Multimodal-based action recognition methods have achieved high success usingpose and RGB modality. However, skeletons sequences lack appearance depictionand RGB images suffer irrelevant noise due to modality limitations. To addressthis, we introduce human parsing feature map as a novel modality, since it canselectively retain effective semantic features of the body parts, whilefiltering out most irrelevant noise. We propose a new dual-branch frameworkcalled Ensemble Human Parsing and Pose Network (EPP-Net), which is the first toleverage both skeletons and human parsing modalities for action recognition.The first human pose branch feeds robust skeletons in graph convolutionalnetwork to model pose features, while the second human parsing branch alsoleverages depictive parsing feature maps to model parsing festures viaconvolutional backbones. The two high-level features will be effectivelycombined through a late fusion strategy for better action recognition.Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistentlyverify the effectiveness of our proposed EPP-Net, which outperforms theexisting action recognition methods. Our code is available at:https://github.com/liujf69/EPP-Net-Action.