2 months ago

Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

Li, Wenhao ; Liu, Hong ; Ding, Runwei ; Liu, Mengyuan ; Wang, Pichao ; Yang, Wenming

Abstract

Despite the great progress in 3D human pose estimation from videos, it isstill an open problem to take full advantage of a redundant 2D pose sequence tolearn representative representations for generating one 3D pose. To this end,we propose an improved Transformer-based architecture, called StridedTransformer, which simply and effectively lifts a long sequence of 2D jointlocations to a single 3D pose. Specifically, a Vanilla Transformer Encoder(VTE) is adopted to model long-range dependencies of 2D pose sequences. Toreduce the redundancy of the sequence, fully-connected layers in thefeed-forward network of VTE are replaced with strided convolutions toprogressively shrink the sequence length and aggregate information from localcontexts. The modified VTE is termed as Strided Transformer Encoder (STE),which is built upon the outputs of VTE. STE not only effectively aggregateslong-range information to a single-vector representation in a hierarchicalglobal and local fashion, but also significantly reduces the computation cost.Furthermore, a full-to-single supervision scheme is designed at both fullsequence and single target frame scales applied to the outputs of VTE and STE,respectively. This scheme imposes extra temporal smoothness constraints inconjunction with the single target frame supervision and hence helps producesmoother and more accurate 3D poses. The proposed Strided Transformer isevaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, andachieves state-of-the-art results with fewer parameters. Code and models areavailable at \url{https://github.com/Vegetebird/StridedTransformer-Pose3D}.