Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation

Pose estimation plays a critical role in human-centered vision applications.However, it is difficult to deploy state-of-the-art HRNet-based pose estimationmodels on resource-constrained edge devices due to the high computational cost(more than 150 GMACs per frame). In this paper, we study efficient architecturedesign for real-time multi-person pose estimation on edge. We reveal thatHRNet's high-resolution branches are redundant for models at thelow-computation region via our gradual shrinking experiments. Removing themimproves both efficiency and performance. Inspired by this finding, we designLitePose, an efficient single-branch architecture for pose estimation, andintroduce two simple approaches to enhance the capacity of LitePose, includingFusion Deconv Head and Large Kernel Convs. Fusion Deconv Head removes theredundancy in high-resolution branches, allowing scale-aware feature fusionwith low overhead. Large Kernel Convs significantly improve the model'scapacity and receptive field while maintaining a low computational cost. Withonly 25% computation increment, 7x7 kernels achieve +14.0 mAP better than 3x3kernels on the CrowdPose dataset. On mobile platforms, LitePose reduces thelatency by up to 5.0x without sacrificing performance, compared with priorstate-of-the-art efficient pose estimation models, pushing the frontier ofreal-time multi-person pose estimation on edge. Our code and pre-trained modelsare released at https://github.com/mit-han-lab/litepose.