Preprint. Under review. MULTI -HYPOTHESIS 3D HUMAN POSE ESTIMATION METRICS FAVOR MISCALIBRATED DISTRIBUTIONS

2025-05-06 0 0 2.08MB 16 页 10玖币
侵权投诉
Preprint. Under review.
MULTI-HYPOTHESIS 3D HUMAN POSE ESTIMATION
METRICS FAVOR MISCALIBRATED DISTRIBUTIONS
Paweł A. Pierzchlewicz1,2, R. James Cotton3,4, Mohammad Bashiri1,2, Fabian H. Sinz1,2,5,6
1Institute for Bioinformatics and Medical Informatics, T¨
ubingen University, T¨
ubingen, Germany
2Department of Computer Science, G¨
ottingen University, G¨
ottingen, Germany
3Shirley Ryan AbilityLab, Chicago, IL, USA
4Department of Physical Medicine and Rehabilitation, Northwestern University, Evanston, IL, USA
5Department of Neuroscience, Baylor College of Medicine, Houston, TX, USA
6Center for Neuroscience and Artificial Intelligence, Baylor College of Medicine, Houston, TX, USA
{ppierzc,bashiri,sinz}@cs.uni-goettingen.de
rcotton@sralab.org
ABSTRACT
Due to depth ambiguities and occlusions, lifting 2D poses to 3D is a highly ill-
posed problem. Well-calibrated distributions of possible poses can make these
ambiguities explicit and preserve the resulting uncertainty for downstream tasks.
This study shows that previous attempts, which account for these ambiguities via
multiple hypotheses generation, produce miscalibrated distributions. We identify
that miscalibration can be attributed to the use of sample-based metrics such as
minMPJPE. In a series of simulations, we show that minimizing minMPJPE,
as commonly done, should converge to the correct mean prediction. However,
it fails to correctly capture the uncertainty, thus resulting in a miscalibrated dis-
tribution. To mitigate this problem, we propose an accurate and well-calibrated
model called Conditional Graph Normalizing Flow (cGNFs). Our model is struc-
tured such that a single cGNF can estimate both conditional and marginal den-
sities within the same model – effectively solving a zero-shot density estimation
problem. We evaluate cGNF on the Human 3.6M dataset and show that cGNF
provides a well-calibrated distribution estimate while being close to state-of-the-
art in terms of overall minMPJPE. Furthermore, cGNF outperforms previous
methods on occluded joints while it remains well-calibrated 1.
1 INTRODUCTION
The task of estimating the 3D human pose from 2D images is a classical problem in computer
vision and has received significant attention over the years (Agarwal & Triggs, 2004; Mori & Malik,
2006; Bo et al., 2008). With the advent of deep learning, various approaches have been applied
to this problem with many of them achieving impressive results (Martinez et al., 2017; Pavlakos
et al., 2016; 2018; Zhao et al., 2019; Zou & Tang, 2021). However, the task of 3D pose estimation
from 2D images is highly ill-posed: A single 2D joint can often be associated with multiple 3D
positions, and due to occlusions, many joints can be entirely missing from the image. While many
previous studies still estimate one single solution for each image (Martinez et al., 2017; Pavlakos
et al., 2017; Sun et al., 2017; Zhao et al., 2019; Zhang et al., 2021), some attempts have been made
to generate multiple hypotheses to account for these ambiguities (Li & Lee, 2019; Sharma et al.,
2019; Wehrbein et al., 2021; Oikarinen et al., 2020; Li & Lee, 2020). Many of these approaches rely
on estimating the conditional distribution of 3D poses given the 2D observation implicitly through
sample-based methods. Since direct likelihood estimation in sample-based methods is usually not
feasible, different sample-based evaluation metrics have become popular. As a result, the field’s
focus has been on the quality of individual samples with respect to the ground truth and not the
quality of the probability distribution of 3D poses itself.
1Code and pretrained model weights are available at https://github.com/sinzlab/cGNF.
1
arXiv:2210.11179v1 [cs.CV] 20 Oct 2022
Preprint. Under review.
In this study, we show that common sample-based metrics in lifting, such as mean per joint position
error, encourage overconfident distributions rather than correct estimates of the true distribution. As
a result, they do not guarantee that the estimated density of 3D poses is a faithful representation of
the underlying data distribution and its ambiguities. As a consequence, their predicted uncertainty
cannot be trusted in downstream decisions, which would be one of the key benefits of a probabilistic
model.
In a series of experiments, we show that a probabilistic lifting model trained with likelihood provides
a higher quality estimated distribution. First, we evaluate the distributions learned by minimizing
minMPJPE instead of negative log-likelihood (NLL) observing that, although the minMPJPE
optimal distributions have a good mean they are not well-calibrated. Next, we use the SimpleBase-
line (Martinez et al., 2017) lifting model with a simple Gaussian noise model on Human3.6M to
demonstrate that a model optimized for NLL is well-calibrated but underperforms on minMPJPE.
The same model optimized for minMPJPE performs well in that metric but turns out to be miscal-
ibrated. To balance this trade-off, we propose an interpretable evaluation strategy that allows com-
paring sample-based methods, while retaining calibration. Finally, we introduce a novel method to
learn the distribution of 3D poses conditioned on the available 2D keypoint positions. To that end,
we propose a Conditional Graph Normalizing Flow (cGNF). Unlike previous methods, cGNF does
not require training a separate model for the prior and posterior. Thus, our model does not require
an adversarial loss term, as opposed to Wehrbein et al. (2021) and Kolotouros et al. (2021). By
evaluating the cGNF’s performance on the Human 3.6M dataset (Ionescu et al., 2014), we show
that, in contrast to previous methods, our model is well calibrated while being close to state-of-the-
art in terms of overall minMPJPE, and that it significantly outperforms prior work in accuracy on
occluded joints.
2 RELATED WORK
Lifting Models Estimating the human 3D pose from a 2D image is an active research area
(Pavlakos et al., 2016; Martinez et al., 2017; Zhao et al., 2019; Wu et al., 2022). An effective
approach is to decouple 2D keypoint detection from 3D pose estimation (Martinez et al., 2017).
First, the 2D keypoints are estimated from the image using a 2D keypoint detector, then a lifting
model uses just these keypoints to obtain a 3D pose estimate. Since the task of estimating a 3D
pose from 2D data is a highly ill-posed problem, approaches have been proposed to estimate mul-
tiple hypotheses (Li & Lee, 2019; Sharma et al., 2019; Oikarinen et al., 2020; Kolotouros et al.,
2021; Li et al., 2021; Wehrbein et al., 2021). However, these approaches i) do not explicitly account
for occluded or missing keypoints and ii) do not consider the calibration of the estimated densities.
Wehrbein et al. (2021) incorporate a Normalizing Flow (Tabak, 2000) architecture to model the well-
defined 3D to 2D projection and exploit the invertible nature of Normalizing Flows to obtain 2D to
3D estimates. Albeit structured as a Normalizing Flow it is not trained as a probabilistic model.
Instead, the authors optimize the model by minimizing a set of cost functions. All in some form
depend on the distance of hypotheses to the ground truth. In addition, they utilize an adversarial
loss to improve the quality of the hypotheses. The proposed model achieves high performance on
popular metrics in multi-hypothesis pose estimation, which are all sample-based distance measures
rather than distribution-based metrics. Sharma et al. (2019) introduces a conditional variational au-
toencoder architecture with an ordinal ranking to disambiguate depth. Similarly to Wehrbein et al.
(2021), the authors additionally optimize the poses on sample-based reconstruction metrics and re-
port performance on sample-based metrics only.
Sample-Based Metrics in Pose Estimation The most widely used metric in pose estimation is
the mean per joint position error (MPJPE) (Wang et al., 2021). It is defined as the mean Euclidean
distance between the Kground truth joint positions XRK×3and the predicted joint positions
ˆ
XRK×3. Multi-hypothesis pose estimation considers Nhypotheses of positions ˆ
XRN×K×3
and adapts the error to consider the hypothesis closest to the ground truth (Jahangiri & Yuille, 2017).
minMPJPE(ˆ
X,X) = min
n
1
K
K
X
k
ˆ
Xn,k Xk
2
In this work, we refer to this minimum version of the MPJPE as minMPJPE. The percentage of
correct keypoints (PCK) (Toshev & Szegedy, 2013; Tompson et al., 2014; Mehta et al., 2016) is
2
Preprint. Under review.
another widely accepted metric in pose estimation which measures the percentage of keypoints in a
circle of 150mm around the ground truth in terms of minMPJPE. Finally correct pose score (CPS)
proposed by Wandt et al. (2021) considers a pose to be correct if all the keypoints are within a radius
r[0 mm,300 mm]of the ground-truth in terms of minMPJPE.CPS is defined as the area under
the curve of percentage correct poses and r.
Calibration Calibration is an important property of a probabilistic model. It refers to the ability of
a model to correctly reflect the uncertainty in the data. Thus, the confidence of an event assigned by
a well-calibrated model should be equal to the true probability of the event. Humans have a natural
cognitive intuition for probabilities (Cosmides & Tooby, 1996) and good confidence estimates can
provide valuable information to the user, especially in safety-critical applications. Therefore, density
calibration has been an important topic in the machine learning community. Guo et al. (2017) show
that calibration of densities has become especially important in the field of deep learning, where
large models have been shown to be miscalibrated. Brier (1950) introduced the Brier score as a
metric to measure the quality of a forecast. It is defined as the expected squared difference between
the predicted probability ˆ
pRNand the true probability pRNof Nsamples. Naeini et al.
(2015) propose the expected calibration error (ECE) metric which approximates the expectation of
the absolute difference between the predicted probability and the true probability.
ECE = 1
N
N
X
n=1
|ˆ
pnpn|(1)
The lower the ECE the better the calibration of the distribution. A model which predicts the same
probability for all samples has an ECE of 0.5, whereas a perfectly calibrated model has ECE = 0.
Reliability diagrams DeGroot & Fienberg (1983) and Niculescu-Mizil & Caruana (2005) provide
a visual representation of calibration. They display the calibration curve, which is a function of
confidence against the true probability. If the calibration curve is an identity function then the model
is perfectly calibrated.
3 OBSERVING MISCALIBRATION
In this section, we demonstrate that the current state-of-the-art lifting models are not well calibrated.
We consider two of the latest methods: Sharma et al. (2019) and Wehrbein et al. (2021). We compute
the ECE for the two models and visualize their reliability diagrams (Fig .1a).
3.1 CALIBRATION FOR POSE ESTIMATION
We adapt the quantile calibration definition introduced in Song et al. (2019) for pose estimation
problems. For M3D ground-truth poses with Kkeypoints XRM×K×3we generate N= 200
hypotheses ˆ
XRN×M×K×3from the learned model q(X|C)given the 2D pose CRM×K×2.
We compute the per-dimension median position ˜
XRM×K×3of the hypotheses. Next, for each
ground truth example mand keypoint kwe compute the L2distance of each hypothesis ˆ
Xn,m,k
from the median εn,m,k =||ˆ
Xn,m,k ˜
Xm,k||2to obtain a univariate distribution of errors. Using
εn,m,k we obtain an empirical estimate of the cumulative distribution function Φm(ε). Given the
distances ε
m,k of the ground truth X
m,k from the median ˜
Xm,k we compute the frequency ωk(q)of
ε
:,k falling into a particular quantile q[0,1]:
ωk(q) = 1
M
M
X
m=1
1Φm(ε
m,k)q,
Finally, we consider the median curve ω(q)across Kkeypoints. An ideally calibrated model would
result in ω(q) = q. In this case, the error between the median estimate and ground truth would
be consistent with the spread predicted by the inferred distribution. With this formulation, we can
compute the ECE according to equation 1. We report the calibration curves ω(q)and ECEs for
each model in Fig. 1a.
3
Preprint. Under review.
minMPJPE NLL
0
20
40
60
minMPJPE
minMPJPE NLL
0
0.2
0.4
ECE
Miscal. Oracle
0
2
4
minMPJPE
Miscal. Oracle
0
20
40
60
80
NLL
2 1 0 1 2
3
4
5
6
7minMPJPE
21012
ECE
0.0 0.2 0.4 0.6 0.8 1.0
Quantile q
0.0
0.2
0.4
0.6
0.8
1.0
Frequency (q)
Sharma et al. (2019)
perfectly calibrated
ECE = 0.36
Wehrbein et al. (2021)
ECE = 0.18
Low
High
1 5 22 106 500
# Samples
1
3
10
32
100
# Dimensions
minMPJPE optimal
0
0.5
1
Overconfident Underconfident
a
cd
b
SimpleBaseline
minimize
minMPJPE
minimize
NLL
Figure 1: a) Calibration curves of previous lifting models with the corresponding expected calibra-
tion error (ECE) scores. b) Standard deviation σof a Gaussian distribution optimized to minimize
minMPJPE for different number of samples and dimensions. The true σis 0.5 (black line), under-
confident σ > 0.5(blue), overconfident σ < 0.5(pink). The human pose equivalent distribution
(black point, 45 dimensions, 200 samples) compared to an oracle distribution (with true µand σ) in
terms of minMPJPE and NLL.c) Gaussian noise model schematic to the left. The SimpleBaseline
model weights are not trained. Right bar plots compare the performance on minMPJPE and ECE
when optimizing for minMPJPE and NLL.d) Loss landscapes of minMPJPE and ECE for a
1D Gaussian distribution with parameters σand µ. Gold star represents the ground truth values of
σ= 4 and µ= 0. To the right is a schematic of the ECE constrained optimization.
3.2 SAMPLE-BASED METRICS PROMOTE MISCALIBRATION
Here, we show that sample-based metrics are a major component that contributes to miscalibration.
In principle, minMPJPE could be a good surrogate metric for NLL. However, as it became a
common metric for selecting models it might become subject to Goodhart’s Law (Goodhart, 1975)
– “When a measure becomes a target, it ceases to be a good measure” (Strathern, 1997). In the case
of minimizing the mean MPJPE over hypotheses, the posterior distribution collapses onto the mean
(sup. A.1). Similarly, simulations indicate that minMPJPE converges to the correct mean, but it
encourages miscalibration (Fig. 1b,d and A.2).
We illustrate this with a small toy example. Consider Msamples XRM×Dfrom a D-
dimensional Isotropic Normal distribution with mean µRDand variance σ2RDand
an approximate isotropic Normal posterior distribution q(X)with mean µRDand variance
σ2RD. We assume the ground truth mean to be known µ=µand only optimize the variance
σ2to minimize minMPJPE with Nhypotheses. We optimize σ2for different numbers of dimen-
sions Dand hypotheses N. Intuitively, for a small sampling budget drawing samples at the mean
constitutes the least risk of generating a bad sample. With an increase in the number of hypotheses,
increasing variance should gradually become beneficial, as the samples cover more of the volume.
For a sufficiently large number of hypotheses, we can expect the variance to increase beyond the true
variance, as the low probability samples can have sufficient representation. Increasing dimensions
should have an inverse effect since the volume to be covered increases with each dimension. We
observe these effects in the toy example (Fig. 1b). When we consider the case which corresponds
to the 3D pose estimation problem (D= 45 and N= 200, black point in Fig. 1b), we expect
an overconfident distribution based on our toy example. This is also what we observe for the cur-
rent state-of-the-art lifting models (Fig. 1a). Furthermore, we show that the minMPJPE optimal
4
摘要:

Preprint.Underreview.MULTI-HYPOTHESIS3DHUMANPOSEESTIMATIONMETRICSFAVORMISCALIBRATEDDISTRIBUTIONSPaweA.Pierzchlewicz1;2,R.JamesCotton3;4,MohammadBashiri1;2,FabianH.Sinz1;2;5;61InstituteforBioinformaticsandMedicalInformatics,T¨ubingenUniversity,T¨ubingen,Germany2DepartmentofComputerScience,G¨ottingen...

展开>> 收起<<
Preprint. Under review. MULTI -HYPOTHESIS 3D HUMAN POSE ESTIMATION METRICS FAVOR MISCALIBRATED DISTRIBUTIONS.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.08MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注