HTNet for micro-expression recognition

Facial expression is related to facial muscle contractions and differentmuscle movements correspond to different emotional states. For micro-expressionrecognition, the muscle movements are usually subtle, which has a negativeimpact on the performance of current facial emotion recognition algorithms.Most existing methods use self-attention mechanisms to capture relationshipsbetween tokens in a sequence, but they do not take into account the inherentspatial relationships between facial landmarks. This can result in sub-optimalperformance on micro-expression recognition tasks.Therefore, learning torecognize facial muscle movements is a key challenge in the area ofmicro-expression recognition. In this paper, we propose a HierarchicalTransformer Network (HTNet) to identify critical areas of facial musclemovement. HTNet includes two major components: a transformer layer thatleverages the local temporal features and an aggregation layer that extractslocal and global semantical facial features. Specifically, HTNet divides theface into four different facial areas: left lip area, left eye area, right eyearea and right lip area. The transformer layer is used to focus on representinglocal minor muscle movement with local self-attention in each area. Theaggregation layer is used to learn the interactions between eye areas and lipareas. The experiments on four publicly available micro-expression datasetsshow that the proposed approach outperforms previous methods by a large margin.The codes and models are available at:\url{https://github.com/wangzhifengharrison/HTNet}