
Preprint. Under review.
In this study, we show that common sample-based metrics in lifting, such as mean per joint position
error, encourage overconfident distributions rather than correct estimates of the true distribution. As
a result, they do not guarantee that the estimated density of 3D poses is a faithful representation of
the underlying data distribution and its ambiguities. As a consequence, their predicted uncertainty
cannot be trusted in downstream decisions, which would be one of the key benefits of a probabilistic
model.
In a series of experiments, we show that a probabilistic lifting model trained with likelihood provides
a higher quality estimated distribution. First, we evaluate the distributions learned by minimizing
minMPJPE instead of negative log-likelihood (NLL) observing that, although the minMPJPE
optimal distributions have a good mean they are not well-calibrated. Next, we use the SimpleBase-
line (Martinez et al., 2017) lifting model with a simple Gaussian noise model on Human3.6M to
demonstrate that a model optimized for NLL is well-calibrated but underperforms on minMPJPE.
The same model optimized for minMPJPE performs well in that metric but turns out to be miscal-
ibrated. To balance this trade-off, we propose an interpretable evaluation strategy that allows com-
paring sample-based methods, while retaining calibration. Finally, we introduce a novel method to
learn the distribution of 3D poses conditioned on the available 2D keypoint positions. To that end,
we propose a Conditional Graph Normalizing Flow (cGNF). Unlike previous methods, cGNF does
not require training a separate model for the prior and posterior. Thus, our model does not require
an adversarial loss term, as opposed to Wehrbein et al. (2021) and Kolotouros et al. (2021). By
evaluating the cGNF’s performance on the Human 3.6M dataset (Ionescu et al., 2014), we show
that, in contrast to previous methods, our model is well calibrated while being close to state-of-the-
art in terms of overall minMPJPE, and that it significantly outperforms prior work in accuracy on
occluded joints.
2 RELATED WORK
Lifting Models Estimating the human 3D pose from a 2D image is an active research area
(Pavlakos et al., 2016; Martinez et al., 2017; Zhao et al., 2019; Wu et al., 2022). An effective
approach is to decouple 2D keypoint detection from 3D pose estimation (Martinez et al., 2017).
First, the 2D keypoints are estimated from the image using a 2D keypoint detector, then a lifting
model uses just these keypoints to obtain a 3D pose estimate. Since the task of estimating a 3D
pose from 2D data is a highly ill-posed problem, approaches have been proposed to estimate mul-
tiple hypotheses (Li & Lee, 2019; Sharma et al., 2019; Oikarinen et al., 2020; Kolotouros et al.,
2021; Li et al., 2021; Wehrbein et al., 2021). However, these approaches i) do not explicitly account
for occluded or missing keypoints and ii) do not consider the calibration of the estimated densities.
Wehrbein et al. (2021) incorporate a Normalizing Flow (Tabak, 2000) architecture to model the well-
defined 3D to 2D projection and exploit the invertible nature of Normalizing Flows to obtain 2D to
3D estimates. Albeit structured as a Normalizing Flow it is not trained as a probabilistic model.
Instead, the authors optimize the model by minimizing a set of cost functions. All in some form
depend on the distance of hypotheses to the ground truth. In addition, they utilize an adversarial
loss to improve the quality of the hypotheses. The proposed model achieves high performance on
popular metrics in multi-hypothesis pose estimation, which are all sample-based distance measures
rather than distribution-based metrics. Sharma et al. (2019) introduces a conditional variational au-
toencoder architecture with an ordinal ranking to disambiguate depth. Similarly to Wehrbein et al.
(2021), the authors additionally optimize the poses on sample-based reconstruction metrics and re-
port performance on sample-based metrics only.
Sample-Based Metrics in Pose Estimation The most widely used metric in pose estimation is
the mean per joint position error (MPJPE) (Wang et al., 2021). It is defined as the mean Euclidean
distance between the Kground truth joint positions X∈RK×3and the predicted joint positions
ˆ
X∈RK×3. Multi-hypothesis pose estimation considers Nhypotheses of positions ˆ
X∈RN×K×3
and adapts the error to consider the hypothesis closest to the ground truth (Jahangiri & Yuille, 2017).
minMPJPE(ˆ
X,X) = min
n
1
K
K
X
k
ˆ
Xn,k −Xk
2
In this work, we refer to this minimum version of the MPJPE as minMPJPE. The percentage of
correct keypoints (PCK) (Toshev & Szegedy, 2013; Tompson et al., 2014; Mehta et al., 2016) is
2