Heatmap Distribution Matching for Human Pose Estimation Haoxuan Qu

2025-05-06 0 0 4.91MB 12 页 10玖币

侵权投诉

Heatmap Distribution Matching for Human Pose

Estimation

Haoxuan Qu

SUTD

Singapore

haoxuan_qu@mymail.sutd.edu.sg

Li Xu

SUTD

Singapore

li_xu@mymail.sutd.edu.sg

Yujun Cai

NTU

Singapore

yujun001@e.ntu.edu.sg

Lin Geng Foo

SUTD

Singapore

lingeng_foo@mymail.sutd.edu.sg

Jun Liu ∗

SUTD

Singapore

jun_liu@sutd.edu.sg

Abstract

For tackling the task of 2D human pose estimation, the great majority of the re-

cent methods regard this task as a heatmap estimation problem, and optimize the

heatmap prediction using the Gaussian-smoothed heatmap as the optimization

objective and using the pixel-wise loss (e.g. MSE) as the loss function. In this

paper, we show that optimizing the heatmap prediction in such a way, the model

performance of body joint localization, which is the intrinsic objective of this

task, may not be consistently improved during the optimization process of the

heatmap prediction. To address this problem, from a novel perspective, we propose

to formulate the optimization of the heatmap prediction as a distribution match-

ing problem between the predicted heatmap and the dot annotation of the body

joint directly. By doing so, our proposed method does not need to construct the

Gaussian-smoothed heatmap and can achieve a more consistent model performance

improvement during the optimization of the heatmap prediction. We show the

effectiveness of our proposed method through extensive experiments on the COCO

dataset and the MPII dataset.

1 Introduction

2D human pose estimation aims to locate body joints of a person in a given RGB image. It is relevant

to a variety of applications, such as action recognition [

], human-machine interaction [

], and sign

language understanding [

]. For tackling the task of 2D human pose estimation, most of the recent

methods [

] are heatmap-based, i.e., they regard 2D human

pose estimation as a heatmap estimation problem. Speciﬁcally, for each body joint, these methods

generally estimate a grid-like heatmap, on which each pixel value represents the probability that

this pixel contains the body joint. Compared to the methods [

] that directly regress the

coordinates of body joints (i.e. coordinate regression-based methods), the heatmap-based methods

∗Corresponding Author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.00740v2 [cs.CV] 4 Oct 2022

Figure 1: Illustration of heatmaps. Although the pixel-wise MSE loss calculated between the predicted

heatmap #2 and the Gaussian-smoothed heatmap is smaller than the loss calculated between the

predicted heatmap #1 and the Gaussian-smoothed heatmap, the predicted heatmap #2 localizes the

body joint wrongly, whereas the predicted heatmap #1 localizes the body joint correctly.

demonstrate a more robust performance since they maintain the spatial structure of the input image

throughout the encoding and decoding process [7].

During the training process of the heatmap-based methods, an important step is the optimization

of the heatmap prediction. This optimization can be done naively via constructing a dot-annotated

heatmap for each body joint as shown in Fig. 1(a), and then measuring the difference (i.e., conducting

pixel-wise comparison) between the predicted heatmap and the constructed ground-truth (GT) dot-

annotated heatmap. However, such a dot-annotated heatmap is sparse, as it has the same zero

value for all pixels except the pixel representing the dot annotation of the body joint. Because of

this, optimizing the heatmap prediction in such a naive way can lead to a hard training process

and a suboptimal model performance [

]. To address this problem, most heatmap-based methods

[

] adopt a strategy to construct Gaussian-smoothed heatmaps,

where pixels near the dot annotation have larger pixel values than pixels far from the dot annotation.

Speciﬁcally, they construct the Gaussian-smoothed heatmap via smoothing the dot annotation of the

body joint through a Gaussian distribution as shown in Fig. 1(b), instead of only setting the pixel

representing the dot annotation to be one.

While easing the training process, constructing the GT Gaussian-smoothed heatmap still brings

problems into the model training process. Firstly, for constructing the Gaussian-smoothed heatmap,

we need to choose a proper standard deviation of the Gaussian distributions. However, the proper

standard deviations of the Gaussian distributions (i.e., the standard deviations that can lead to an

optimal performance) often vary across different types of body joints, different body postures,

and different body sizes [

]. Hence, the standard deviations of the Gaussian distributions often

need to be carefully chosen, which is non-trivial. Secondly, during the process of optimizing the

heatmap prediction by minimizing the pixel-wise loss (e.g. MSE) between the predicted heatmap

and the Gaussian-smoothed heatmap, the model performance of body joint localization may not

be consistently improved. As shown in Fig. 1, although compared to the loss calculated between

the predicted heatmap #1 and the Gaussian-smoothed heatmap, the pixel-wise MSE loss calculated

between the predicted heatmap #2 and the Gaussian-smoothed heatmap is smaller, the predicted

heatmap #2 localizes the body joint wrongly, whereas the predicted heatmap #1 localizes the body

joint correctly.

As a result, optimizing the heatmap prediction using the dot-annotated heatmap and the Gaussian-

smoothed heatmap as the ground-truth both have their respective problems. Hence, in this work,

we aim to tackle their respective problems, and propose to optimize the heatmap prediction directly

via minimizing the difference between the predicted heatmap and the dot annotation. By doing so,

we can optimize the model directly towards accurately localizing the dot annotation of the body

joint, which is the intrinsic objective of 2D human pose estimation, instead of optimizing the model

indirectly towards either the dot-annotated heatmap or the Gaussian-smoothed heatmap. However,

as the number of pixels in the predicted heatmap and the number of entries representing the dot

annotation are different, we cannot measure the difference between the predicted heatmap and the

dot annotation trivially by measuring their entry-wise difference. To handle this problem, inspired

by the fact that we can measure the difference between two distributions via measuring their Earth

Mover’s Distance even if they have different numbers of entries, in this paper, we propose to ﬁrst

formulate the optimization of the heatmap prediction as a distribution matching problem. Speciﬁcally,

we construct two distributions respectively from the predicted heatmap and the dot annotation. After

that, we optimize the heatmap prediction via minimizing the distribution difference based on the

Earth Mover’s Distance. Using such a novel method to optimize the heatmap prediction directly from

the dot annotation, we do not need to construct the Gaussian-smoothed heatmap, as well as avoiding

the issues of the binary dot-annotated heatmap. Thus, our method achieves superior performance.

Our proposed method is simple yet effective, which can be easily applied to various off-the-shelf

2D human pose estimation models by replacing their original loss function with our proposed loss

function measuring the distribution difference between the predicted heatmap and the dot annotation.

We experiment our proposed method on multiple models and our method achieves a consistent model

performance improvement.

The contributions of our work are summarized as follows. 1) We analyze (in Sec. 4) that the

performance of the human pose estimation model may not be consistently improved during the process

of minimizing the pixel-wise loss between the predicted heatmap and the GT Gaussian-smoothed

heatmap. 2) From a novel perspective, we formulate the optimization of the heatmap prediction as a

distribution matching problem between the predicted heatmap and the GT dot annotation directly,

which bypasses the step of constructing the Gaussian-smoothed heatmap and achieves consistent

model performance improvement. 3) Our proposed method achieves state-of-the-art performance on

the evaluated benchmarks.

2 Related Work

2D Human Pose Estimation.

Due to the wide range of applications, the task of 2D human pose

estimation has received lots of attention [

]. DeepPose [

] made the ﬁrst attempt of applying deep neural networks into the task

of 2D human pose estimation via directly regressing the coordinates of body joints. This type of

coordinate regression-based methods [

] often show inferior performances compared

to the heatmap-based methods [

], as the heatmap-based

methods can preserve the spatial structure of the input image throughout the encoding and decoding

process [

]. Hence, recently, the great majority of the state-of-the-art methods [

] regard 2D human pose estimation as a heatmap estimation problem

instead of the coordinate regression problem. Among the heatmap-based methods, Tompson et al.

[

] proposed to apply Markov Random Field (MRF) into the task of 2D human pose estimation.

After that, an "hourglass" network, with a conv-deconv architecture, was proposed by Newell et

al. [

]. Xiao et al. [

] proposed a baseline method to predict the heatmap via adding several

deconvolutional layers to a backbone network. Later on, to maintain high-resolution representations

throughout the heatmap estimation process, HRNet was proposed by Sun et al. [

]. Yuan et al.

[

] further proposed HRFormer to learn the high-resolution representations utilizing a transformer-

based architecture. Besides the above heatmap-based methods that use the Gaussian-smoothed

heatmap as the optimization objective, there are also some methods [

] that combine the idea

of heatmap and coordinate regression by taking the expectation of the predicted heatmap as the

predicted coordinates.

Here in this work, different from previous works, our method bypasses both the step of regression

and the step of constructing the Gaussian-smoothed heatmap as the optimization objective. Instead,

from a novel perspective, we propose to formulate the optimization of the heatmap prediction as

a distribution matching problem by minimizing the distribution difference between the predicted

heatmap and the dot annotation.

Distribution Matching.

The idea of distribution matching has been studied in various tasks [

], such as image retrieval [

], tracking [

], few-shot learning [

], and long-tail

recognition [

]. In this work, from a novel perspective, we design a new distribution matching

scheme to optimize the heatmap prediction with the help of sub-pixel resolutions for 2D human pose

estimation.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HeatmapDistributionMatchingforHumanPoseEstimationHaoxuanQuSUTDSingaporehaoxuan_qu@mymail.sutd.edu.sgLiXuSUTDSingaporeli_xu@mymail.sutd.edu.sgYujunCaiNTUSingaporeyujun001@e.ntu.edu.sgLinGengFooSUTDSingaporelingeng_foo@mymail.sutd.edu.sgJunLiuSUTDSingaporejun_liu@sutd.edu.sgAbstractFortacklingthetask...

展开>> 收起<<

Heatmap Distribution Matching for Human Pose Estimation Haoxuan Qu.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Heatmap Distribution Matching for Human Pose Estimation Haoxuan Qu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: