Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

Video-based 3D human pose and shape estimations are evaluated by intra-frameaccuracy and inter-frame smoothness. Although these two metrics are responsiblefor different ranges of temporal consistency, existing state-of-the-art methodstreat them as a unified problem and use monotonous modeling structures (e.g.,RNN or attention-based block) to design their networks. However, using a singlekind of modeling structure is difficult to balance the learning of short-termand long-term temporal correlations, and may bias the network to one of them,leading to undesirable predictions like global location shift, temporalinconsistency, and insufficient local details. To solve these problems, wepropose to structurally decouple the modeling of long-term and short-termcorrelations in an end-to-end framework, Global-to-Local Transformer (GLoT).First, a global transformer is introduced with a Masked Pose and ShapeEstimation strategy for long-term modeling. The strategy stimulates the globaltransformer to learn more inter-frame correlations by randomly masking thefeatures of several frames. Second, a local transformer is responsible forexploiting local details on the human mesh and interacting with the globaltransformer by leveraging cross-attention. Moreover, a Hierarchical SpatialCorrelation Regressor is further introduced to refine intra-frame estimationsby decoupled global-local representation and implicit kinematic constraints.Our GLoT surpasses previous state-of-the-art methods with the lowest modelparameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.Codes are available at https://github.com/sxl142/GLoT.