Multi-scale Context-aware Network with Transformer for Gait Recognition

Although gait recognition has drawn increasing research attention recently,since the silhouette differences are quite subtle in spatial domain, temporalfeature representation is crucial for gait recognition. Inspired by theobservation that humans can distinguish gaits of different subjects byadaptively focusing on clips of varying time scales, we propose a multi-scalecontext-aware network with transformer (MCAT) for gait recognition. MCATgenerates temporal features across three scales, and adaptively aggregates themusing contextual information from both local and global perspectives.Specifically, MCAT contains an adaptive temporal aggregation (ATA) module thatperforms local relation modeling followed by global relation modeling to fusethe multi-scale features. Besides, in order to remedy the spatial featurecorruption resulting from temporal operations, MCAT incorporates a salientspatial feature learning (SSFL) module to select groups of discriminativespatial features. Extensive experiments conducted on three datasets demonstratethe state-of-the-art performance. Concretely, we achieve rank-1 accuracies of98.7%, 96.2% and 88.7% under normal walking, bag-carrying and coat-wearingconditions on CASIA-B, 97.5% on OU-MVLP and 50.6% on GREW. The source code willbe available at https://github.com/zhuduowang/MCAT.git.