Abstract

In this paper, we show the surprisingly good properties of plain visiontransformers for body pose estimation from various aspects, namely simplicityin model structure, scalability in model size, flexibility in trainingparadigm, and transferability of knowledge between models, through a simplebaseline model dubbed ViTPose. Specifically, ViTPose employs the plain andnon-hierarchical vision transformer as an encoder to encode features and alightweight decoder to decode body keypoints in either a top-down or abottom-up manner. It can be scaled up from about 20M to 1B parameters by takingadvantage of the scalable model capacity and high parallelism of the visiontransformer, setting a new Pareto front for throughput and performance.Besides, ViTPose is very flexible regarding the attention type, inputresolution, and pre-training and fine-tuning strategy. Based on theflexibility, a novel ViTPose+ model is proposed to deal with heterogeneous bodykeypoint categories in different types of body pose estimation tasks viaknowledge factorization, i.e., adopting task-agnostic and task-specificfeed-forward networks in the transformer. We also empirically demonstrate thatthe knowledge of large ViTPose models can be easily transferred to small onesvia a simple knowledge token. Experimental results show that our ViTPose modeloutperforms representative methods on the challenging MS COCO Human KeypointDetection benchmark at both top-down and bottom-up settings. Furthermore, ourViTPose+ model achieves state-of-the-art performance simultaneously on a seriesof body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPIIfor human keypoint detection, COCO-Wholebody for whole-body keypoint detection,as well as AP-10K and APT-36K for animal keypoint detection, withoutsacrificing inference speed.

Source PDF View Code