8 months ago

Abstract

Recently, transformer-based methods have gained significant success insequential 2D-to-3D lifting human pose estimation. As a pioneering work,PoseFormer captures spatial relations of human joints in each video frame andhuman dynamics across frames with cascaded transformer layers and has achievedimpressive performance. However, in real scenarios, the performance ofPoseFormer and its follow-ups is limited by two factors: (a) The length of theinput joint sequence; (b) The quality of 2D joint detection. Existing methodstypically apply self-attention to all frames of the input sequence, causing ahuge computational burden when the frame number is increased to obtain advancedestimation accuracy, and they are not robust to noise naturally brought by thelimited capability of 2D joint detectors. In this paper, we proposePoseFormerV2, which exploits a compact representation of lengthy skeletonsequences in the frequency domain to efficiently scale up the receptive fieldand boost robustness to noisy 2D joint detection. With minimum modifications toPoseFormer, the proposed method effectively fuses features both in the timedomain and frequency domain, enjoying a better speed-accuracy trade-off thanits precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6Mand MPI-INF-3DHP) demonstrate that the proposed approach significantlyoutperforms the original PoseFormer and other transformer-based variants. Codeis released at \url{https://github.com/QitaoZhao/PoseFormerV2}.

Source PDF