Multi-hypothesis 3D human pose estimation metrics favor miscalibrated distributions

Due to depth ambiguities and occlusions, lifting 2D poses to 3D is a highlyill-posed problem. Well-calibrated distributions of possible poses can makethese ambiguities explicit and preserve the resulting uncertainty fordownstream tasks. This study shows that previous attempts, which account forthese ambiguities via multiple hypotheses generation, produce miscalibrateddistributions. We identify that miscalibration can be attributed to the use ofsample-based metrics such as minMPJPE. In a series of simulations, we show thatminimizing minMPJPE, as commonly done, should converge to the correct meanprediction. However, it fails to correctly capture the uncertainty, thusresulting in a miscalibrated distribution. To mitigate this problem, we proposean accurate and well-calibrated model called Conditional Graph Normalizing Flow(cGNFs). Our model is structured such that a single cGNF can estimate bothconditional and marginal densities within the same model - effectively solvinga zero-shot density estimation problem. We evaluate cGNF on the Human~3.6Mdataset and show that cGNF provides a well-calibrated distribution estimatewhile being close to state-of-the-art in terms of overall minMPJPE.Furthermore, cGNF outperforms previous methods on occluded joints while itremains well-calibrated.