Towards Good Practices for Deep 3D Hand Pose Estimation

3D hand pose estimation from single depth image is an important andchallenging problem for human-computer interaction. Recently deep convolutionalnetworks (ConvNet) with sophisticated design have been employed to address it,but the improvement over traditional random forest based methods is not soapparent. To exploit the good practice and promote the performance for handpose estimation, we propose a tree-structured Region Ensemble Network (REN) fordirectly 3D coordinate regression. It first partitions the last convolutionoutputs of ConvNet into several grid regions. The results from separatefully-connected (FC) regressors on each regions are then integrated by anotherFC layer to perform the estimation. By exploitation of several trainingstrategies including data augmentation and smooth $L_1$ loss, proposed REN cansignificantly improve the performance of ConvNet to localize hand joints. Theexperimental results demonstrate that our approach achieves the bestperformance among state-of-the-art algorithms on three public hand posedatasets. We also experiment our methods on fingertip detection and human posedatasets and obtain state-of-the-art accuracy.