Preprint. Under review. MULTI -HYPOTHESIS 3D HUMAN POSE ESTIMATION METRICS FAVOR MISCALIBRATED DISTRIBUTIONS

2025-05-06 0 0 2.08MB 16 页 10玖币

侵权投诉

Preprint. Under review.

MULTI-HYPOTHESIS 3D HUMAN POSE ESTIMATION

METRICS FAVOR MISCALIBRATED DISTRIBUTIONS

Paweł A. Pierzchlewicz1,2, R. James Cotton3,4, Mohammad Bashiri1,2, Fabian H. Sinz1,2,5,6

1Institute for Bioinformatics and Medical Informatics, T¨

ubingen University, T¨

ubingen, Germany

2Department of Computer Science, G¨

ottingen University, G¨

ottingen, Germany

3Shirley Ryan AbilityLab, Chicago, IL, USA

4Department of Physical Medicine and Rehabilitation, Northwestern University, Evanston, IL, USA

5Department of Neuroscience, Baylor College of Medicine, Houston, TX, USA

6Center for Neuroscience and Artiﬁcial Intelligence, Baylor College of Medicine, Houston, TX, USA

{ppierzc,bashiri,sinz}@cs.uni-goettingen.de

rcotton@sralab.org

ABSTRACT

Due to depth ambiguities and occlusions, lifting 2D poses to 3D is a highly ill-

posed problem. Well-calibrated distributions of possible poses can make these

ambiguities explicit and preserve the resulting uncertainty for downstream tasks.

This study shows that previous attempts, which account for these ambiguities via

multiple hypotheses generation, produce miscalibrated distributions. We identify

that miscalibration can be attributed to the use of sample-based metrics such as

minMPJPE. In a series of simulations, we show that minimizing minMPJPE,

as commonly done, should converge to the correct mean prediction. However,

it fails to correctly capture the uncertainty, thus resulting in a miscalibrated dis-

tribution. To mitigate this problem, we propose an accurate and well-calibrated

model called Conditional Graph Normalizing Flow (cGNFs). Our model is struc-

tured such that a single cGNF can estimate both conditional and marginal den-

sities within the same model – effectively solving a zero-shot density estimation

problem. We evaluate cGNF on the Human 3.6M dataset and show that cGNF

provides a well-calibrated distribution estimate while being close to state-of-the-

art in terms of overall minMPJPE. Furthermore, cGNF outperforms previous

methods on occluded joints while it remains well-calibrated 1.

1 INTRODUCTION

The task of estimating the 3D human pose from 2D images is a classical problem in computer

vision and has received signiﬁcant attention over the years (Agarwal & Triggs, 2004; Mori & Malik,

2006; Bo et al., 2008). With the advent of deep learning, various approaches have been applied

to this problem with many of them achieving impressive results (Martinez et al., 2017; Pavlakos

et al., 2016; 2018; Zhao et al., 2019; Zou & Tang, 2021). However, the task of 3D pose estimation

from 2D images is highly ill-posed: A single 2D joint can often be associated with multiple 3D

positions, and due to occlusions, many joints can be entirely missing from the image. While many

previous studies still estimate one single solution for each image (Martinez et al., 2017; Pavlakos

et al., 2017; Sun et al., 2017; Zhao et al., 2019; Zhang et al., 2021), some attempts have been made

to generate multiple hypotheses to account for these ambiguities (Li & Lee, 2019; Sharma et al.,

2019; Wehrbein et al., 2021; Oikarinen et al., 2020; Li & Lee, 2020). Many of these approaches rely

on estimating the conditional distribution of 3D poses given the 2D observation implicitly through

sample-based methods. Since direct likelihood estimation in sample-based methods is usually not

feasible, different sample-based evaluation metrics have become popular. As a result, the ﬁeld’s

focus has been on the quality of individual samples with respect to the ground truth and not the

quality of the probability distribution of 3D poses itself.

1Code and pretrained model weights are available at https://github.com/sinzlab/cGNF.

arXiv:2210.11179v1 [cs.CV] 20 Oct 2022

Preprint. Under review.

In this study, we show that common sample-based metrics in lifting, such as mean per joint position

error, encourage overconﬁdent distributions rather than correct estimates of the true distribution. As

a result, they do not guarantee that the estimated density of 3D poses is a faithful representation of

the underlying data distribution and its ambiguities. As a consequence, their predicted uncertainty

cannot be trusted in downstream decisions, which would be one of the key beneﬁts of a probabilistic

model.

In a series of experiments, we show that a probabilistic lifting model trained with likelihood provides

a higher quality estimated distribution. First, we evaluate the distributions learned by minimizing

minMPJPE instead of negative log-likelihood (NLL) observing that, although the minMPJPE

optimal distributions have a good mean they are not well-calibrated. Next, we use the SimpleBase-

line (Martinez et al., 2017) lifting model with a simple Gaussian noise model on Human3.6M to

demonstrate that a model optimized for NLL is well-calibrated but underperforms on minMPJPE.

The same model optimized for minMPJPE performs well in that metric but turns out to be miscal-

ibrated. To balance this trade-off, we propose an interpretable evaluation strategy that allows com-

paring sample-based methods, while retaining calibration. Finally, we introduce a novel method to

learn the distribution of 3D poses conditioned on the available 2D keypoint positions. To that end,

we propose a Conditional Graph Normalizing Flow (cGNF). Unlike previous methods, cGNF does

not require training a separate model for the prior and posterior. Thus, our model does not require

an adversarial loss term, as opposed to Wehrbein et al. (2021) and Kolotouros et al. (2021). By

evaluating the cGNF’s performance on the Human 3.6M dataset (Ionescu et al., 2014), we show

that, in contrast to previous methods, our model is well calibrated while being close to state-of-the-

art in terms of overall minMPJPE, and that it signiﬁcantly outperforms prior work in accuracy on

occluded joints.

2 RELATED WORK

Lifting Models Estimating the human 3D pose from a 2D image is an active research area

(Pavlakos et al., 2016; Martinez et al., 2017; Zhao et al., 2019; Wu et al., 2022). An effective

approach is to decouple 2D keypoint detection from 3D pose estimation (Martinez et al., 2017).

First, the 2D keypoints are estimated from the image using a 2D keypoint detector, then a lifting

model uses just these keypoints to obtain a 3D pose estimate. Since the task of estimating a 3D

pose from 2D data is a highly ill-posed problem, approaches have been proposed to estimate mul-

tiple hypotheses (Li & Lee, 2019; Sharma et al., 2019; Oikarinen et al., 2020; Kolotouros et al.,

2021; Li et al., 2021; Wehrbein et al., 2021). However, these approaches i) do not explicitly account

for occluded or missing keypoints and ii) do not consider the calibration of the estimated densities.

Wehrbein et al. (2021) incorporate a Normalizing Flow (Tabak, 2000) architecture to model the well-

deﬁned 3D to 2D projection and exploit the invertible nature of Normalizing Flows to obtain 2D to

3D estimates. Albeit structured as a Normalizing Flow it is not trained as a probabilistic model.

Instead, the authors optimize the model by minimizing a set of cost functions. All in some form

depend on the distance of hypotheses to the ground truth. In addition, they utilize an adversarial

loss to improve the quality of the hypotheses. The proposed model achieves high performance on

popular metrics in multi-hypothesis pose estimation, which are all sample-based distance measures

rather than distribution-based metrics. Sharma et al. (2019) introduces a conditional variational au-

toencoder architecture with an ordinal ranking to disambiguate depth. Similarly to Wehrbein et al.

(2021), the authors additionally optimize the poses on sample-based reconstruction metrics and re-

port performance on sample-based metrics only.

Sample-Based Metrics in Pose Estimation The most widely used metric in pose estimation is

the mean per joint position error (MPJPE) (Wang et al., 2021). It is deﬁned as the mean Euclidean

distance between the Kground truth joint positions X∈RK×3and the predicted joint positions

X∈RK×3. Multi-hypothesis pose estimation considers Nhypotheses of positions ˆ

X∈RN×K×3

and adapts the error to consider the hypothesis closest to the ground truth (Jahangiri & Yuille, 2017).

minMPJPE(ˆ

X,X) = min







ˆ

Xn,k −Xk







2

In this work, we refer to this minimum version of the MPJPE as minMPJPE. The percentage of

correct keypoints (PCK) (Toshev & Szegedy, 2013; Tompson et al., 2014; Mehta et al., 2016) is

Preprint. Under review.

another widely accepted metric in pose estimation which measures the percentage of keypoints in a

circle of 150mm around the ground truth in terms of minMPJPE. Finally correct pose score (CPS)

proposed by Wandt et al. (2021) considers a pose to be correct if all the keypoints are within a radius

r∈[0 mm,300 mm]of the ground-truth in terms of minMPJPE.CPS is deﬁned as the area under

the curve of percentage correct poses and r.

Calibration Calibration is an important property of a probabilistic model. It refers to the ability of

a model to correctly reﬂect the uncertainty in the data. Thus, the conﬁdence of an event assigned by

a well-calibrated model should be equal to the true probability of the event. Humans have a natural

cognitive intuition for probabilities (Cosmides & Tooby, 1996) and good conﬁdence estimates can

provide valuable information to the user, especially in safety-critical applications. Therefore, density

calibration has been an important topic in the machine learning community. Guo et al. (2017) show

that calibration of densities has become especially important in the ﬁeld of deep learning, where

large models have been shown to be miscalibrated. Brier (1950) introduced the Brier score as a

metric to measure the quality of a forecast. It is deﬁned as the expected squared difference between

the predicted probability ˆ

p∈RNand the true probability p∈RNof Nsamples. Naeini et al.

(2015) propose the expected calibration error (ECE) metric which approximates the expectation of

the absolute difference between the predicted probability and the true probability.

ECE = 1

n=1

|ˆ

pn−pn|(1)

The lower the ECE the better the calibration of the distribution. A model which predicts the same

probability for all samples has an ECE of 0.5, whereas a perfectly calibrated model has ECE = 0.

Reliability diagrams DeGroot & Fienberg (1983) and Niculescu-Mizil & Caruana (2005) provide

a visual representation of calibration. They display the calibration curve, which is a function of

conﬁdence against the true probability. If the calibration curve is an identity function then the model

is perfectly calibrated.

3 OBSERVING MISCALIBRATION

In this section, we demonstrate that the current state-of-the-art lifting models are not well calibrated.

We consider two of the latest methods: Sharma et al. (2019) and Wehrbein et al. (2021). We compute

the ECE for the two models and visualize their reliability diagrams (Fig .1a).

3.1 CALIBRATION FOR POSE ESTIMATION

We adapt the quantile calibration deﬁnition introduced in Song et al. (2019) for pose estimation

problems. For M3D ground-truth poses with Kkeypoints X∗∈RM×K×3we generate N= 200

hypotheses ˆ

X∈RN×M×K×3from the learned model q(X|C)given the 2D pose C∈RM×K×2.

We compute the per-dimension median position ˜

X∈RM×K×3of the hypotheses. Next, for each

ground truth example mand keypoint kwe compute the L2distance of each hypothesis ˆ

Xn,m,k

from the median εn,m,k =||ˆ

Xn,m,k −˜

Xm,k||2to obtain a univariate distribution of errors. Using

εn,m,k we obtain an empirical estimate of the cumulative distribution function Φm(ε). Given the

distances ε∗

m,k of the ground truth X∗

m,k from the median ˜

Xm,k we compute the frequency ωk(q)of

ε∗

:,k falling into a particular quantile q∈[0,1]:

ωk(q) = 1

m=1

1Φm(ε∗

m,k)≤q,

Finally, we consider the median curve ω(q)across Kkeypoints. An ideally calibrated model would

result in ω(q) = q. In this case, the error between the median estimate and ground truth would

be consistent with the spread predicted by the inferred distribution. With this formulation, we can

compute the ECE according to equation 1. We report the calibration curves ω(q)and ECEs for

each model in Fig. 1a.

Preprint. Under review.

minMPJPE NLL

minMPJPE

minMPJPE NLL

0.2

0.4

ECE

Miscal. Oracle

minMPJPE

Miscal. Oracle

NLL

2 1 0 1 2

7minMPJPE

21012

ECE

0.0 0.2 0.4 0.6 0.8 1.0

Quantile q

0.0

0.2

0.4

0.6

0.8

1.0

Frequency (q)

Sharma et al. (2019)

perfectly calibrated

ECE = 0.36

Wehrbein et al. (2021)

ECE = 0.18

Low

High

1 5 22 106 500

# Samples

100

# Dimensions

minMPJPE optimal

0.5

Overconfident Underconfident

SimpleBaseline

minimize

minMPJPE

minimize

NLL

Figure 1: a) Calibration curves of previous lifting models with the corresponding expected calibra-

tion error (ECE) scores. b) Standard deviation σof a Gaussian distribution optimized to minimize

minMPJPE for different number of samples and dimensions. The true σis 0.5 (black line), under-

conﬁdent σ > 0.5(blue), overconﬁdent σ < 0.5(pink). The human pose equivalent distribution

(black point, 45 dimensions, 200 samples) compared to an oracle distribution (with true µand σ) in

terms of minMPJPE and NLL.c) Gaussian noise model schematic to the left. The SimpleBaseline

model weights are not trained. Right bar plots compare the performance on minMPJPE and ECE

when optimizing for minMPJPE and NLL.d) Loss landscapes of minMPJPE and ECE for a

1D Gaussian distribution with parameters σand µ. Gold star represents the ground truth values of

σ∗= 4 and µ∗= 0. To the right is a schematic of the ECE constrained optimization.

3.2 SAMPLE-BASED METRICS PROMOTE MISCALIBRATION

Here, we show that sample-based metrics are a major component that contributes to miscalibration.

In principle, minMPJPE could be a good surrogate metric for NLL. However, as it became a

common metric for selecting models it might become subject to Goodhart’s Law (Goodhart, 1975)

– “When a measure becomes a target, it ceases to be a good measure” (Strathern, 1997). In the case

of minimizing the mean MPJPE over hypotheses, the posterior distribution collapses onto the mean

(sup. A.1). Similarly, simulations indicate that minMPJPE converges to the correct mean, but it

encourages miscalibration (Fig. 1b,d and A.2).

We illustrate this with a small toy example. Consider Msamples X∗∈RM×Dfrom a D-

dimensional Isotropic Normal distribution with mean µ∗∈RDand variance σ∗2∈RDand

an approximate isotropic Normal posterior distribution q(X)with mean µ∈RDand variance

σ2∈RD. We assume the ground truth mean to be known µ=µ∗and only optimize the variance

σ2to minimize minMPJPE with Nhypotheses. We optimize σ2for different numbers of dimen-

sions Dand hypotheses N. Intuitively, for a small sampling budget drawing samples at the mean

constitutes the least risk of generating a bad sample. With an increase in the number of hypotheses,

increasing variance should gradually become beneﬁcial, as the samples cover more of the volume.

For a sufﬁciently large number of hypotheses, we can expect the variance to increase beyond the true

variance, as the low probability samples can have sufﬁcient representation. Increasing dimensions

should have an inverse effect since the volume to be covered increases with each dimension. We

observe these effects in the toy example (Fig. 1b). When we consider the case which corresponds

to the 3D pose estimation problem (D= 45 and N= 200, black point in Fig. 1b), we expect

an overconﬁdent distribution based on our toy example. This is also what we observe for the cur-

rent state-of-the-art lifting models (Fig. 1a). Furthermore, we show that the minMPJPE optimal

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Preprint.Underreview.MULTI-HYPOTHESIS3DHUMANPOSEESTIMATIONMETRICSFAVORMISCALIBRATEDDISTRIBUTIONSPaweA.Pierzchlewicz1;2,R.JamesCotton3;4,MohammadBashiri1;2,FabianH.Sinz1;2;5;61InstituteforBioinformaticsandMedicalInformatics,T¨ubingenUniversity,T¨ubingen,Germany2DepartmentofComputerScience,G¨ottingen...

展开>> 收起<<

Preprint. Under review. MULTI -HYPOTHESIS 3D HUMAN POSE ESTIMATION METRICS FAVOR MISCALIBRATED DISTRIBUTIONS.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint. Under review. MULTI -HYPOTHESIS 3D HUMAN POSE ESTIMATION METRICS FAVOR MISCALIBRATED DISTRIBUTIONS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: