Heatmap Distribution Matching for Human Pose Estimation Haoxuan Qu

2025-05-06 0 0 4.91MB 12 页 10玖币
侵权投诉
Heatmap Distribution Matching for Human Pose
Estimation
Haoxuan Qu
SUTD
Singapore
haoxuan_qu@mymail.sutd.edu.sg
Li Xu
SUTD
Singapore
li_xu@mymail.sutd.edu.sg
Yujun Cai
NTU
Singapore
yujun001@e.ntu.edu.sg
Lin Geng Foo
SUTD
Singapore
lingeng_foo@mymail.sutd.edu.sg
Jun Liu
SUTD
Singapore
jun_liu@sutd.edu.sg
Abstract
For tackling the task of 2D human pose estimation, the great majority of the re-
cent methods regard this task as a heatmap estimation problem, and optimize the
heatmap prediction using the Gaussian-smoothed heatmap as the optimization
objective and using the pixel-wise loss (e.g. MSE) as the loss function. In this
paper, we show that optimizing the heatmap prediction in such a way, the model
performance of body joint localization, which is the intrinsic objective of this
task, may not be consistently improved during the optimization process of the
heatmap prediction. To address this problem, from a novel perspective, we propose
to formulate the optimization of the heatmap prediction as a distribution match-
ing problem between the predicted heatmap and the dot annotation of the body
joint directly. By doing so, our proposed method does not need to construct the
Gaussian-smoothed heatmap and can achieve a more consistent model performance
improvement during the optimization of the heatmap prediction. We show the
effectiveness of our proposed method through extensive experiments on the COCO
dataset and the MPII dataset.
1 Introduction
2D human pose estimation aims to locate body joints of a person in a given RGB image. It is relevant
to a variety of applications, such as action recognition [
34
], human-machine interaction [
40
], and sign
language understanding [
19
]. For tackling the task of 2D human pose estimation, most of the recent
methods [
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
] are heatmap-based, i.e., they regard 2D human
pose estimation as a heatmap estimation problem. Specifically, for each body joint, these methods
generally estimate a grid-like heatmap, on which each pixel value represents the probability that
this pixel contains the body joint. Compared to the methods [
30
,
2
,
32
,
13
] that directly regress the
coordinates of body joints (i.e. coordinate regression-based methods), the heatmap-based methods
Corresponding Author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.00740v2 [cs.CV] 4 Oct 2022
Figure 1: Illustration of heatmaps. Although the pixel-wise MSE loss calculated between the predicted
heatmap #2 and the Gaussian-smoothed heatmap is smaller than the loss calculated between the
predicted heatmap #1 and the Gaussian-smoothed heatmap, the predicted heatmap #2 localizes the
body joint wrongly, whereas the predicted heatmap #1 localizes the body joint correctly.
demonstrate a more robust performance since they maintain the spatial structure of the input image
throughout the encoding and decoding process [7].
During the training process of the heatmap-based methods, an important step is the optimization
of the heatmap prediction. This optimization can be done naively via constructing a dot-annotated
heatmap for each body joint as shown in Fig. 1(a), and then measuring the difference (i.e., conducting
pixel-wise comparison) between the predicted heatmap and the constructed ground-truth (GT) dot-
annotated heatmap. However, such a dot-annotated heatmap is sparse, as it has the same zero
value for all pixels except the pixel representing the dot annotation of the body joint. Because of
this, optimizing the heatmap prediction in such a naive way can lead to a hard training process
and a suboptimal model performance [
28
]. To address this problem, most heatmap-based methods
[
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
] adopt a strategy to construct Gaussian-smoothed heatmaps,
where pixels near the dot annotation have larger pixel values than pixels far from the dot annotation.
Specifically, they construct the Gaussian-smoothed heatmap via smoothing the dot annotation of the
body joint through a Gaussian distribution as shown in Fig. 1(b), instead of only setting the pixel
representing the dot annotation to be one.
While easing the training process, constructing the GT Gaussian-smoothed heatmap still brings
problems into the model training process. Firstly, for constructing the Gaussian-smoothed heatmap,
we need to choose a proper standard deviation of the Gaussian distributions. However, the proper
standard deviations of the Gaussian distributions (i.e., the standard deviations that can lead to an
optimal performance) often vary across different types of body joints, different body postures,
and different body sizes [
17
]. Hence, the standard deviations of the Gaussian distributions often
need to be carefully chosen, which is non-trivial. Secondly, during the process of optimizing the
heatmap prediction by minimizing the pixel-wise loss (e.g. MSE) between the predicted heatmap
and the Gaussian-smoothed heatmap, the model performance of body joint localization may not
be consistently improved. As shown in Fig. 1, although compared to the loss calculated between
the predicted heatmap #1 and the Gaussian-smoothed heatmap, the pixel-wise MSE loss calculated
between the predicted heatmap #2 and the Gaussian-smoothed heatmap is smaller, the predicted
heatmap #2 localizes the body joint wrongly, whereas the predicted heatmap #1 localizes the body
joint correctly.
As a result, optimizing the heatmap prediction using the dot-annotated heatmap and the Gaussian-
smoothed heatmap as the ground-truth both have their respective problems. Hence, in this work,
we aim to tackle their respective problems, and propose to optimize the heatmap prediction directly
via minimizing the difference between the predicted heatmap and the dot annotation. By doing so,
we can optimize the model directly towards accurately localizing the dot annotation of the body
joint, which is the intrinsic objective of 2D human pose estimation, instead of optimizing the model
indirectly towards either the dot-annotated heatmap or the Gaussian-smoothed heatmap. However,
as the number of pixels in the predicted heatmap and the number of entries representing the dot
2
annotation are different, we cannot measure the difference between the predicted heatmap and the
dot annotation trivially by measuring their entry-wise difference. To handle this problem, inspired
by the fact that we can measure the difference between two distributions via measuring their Earth
Mover’s Distance even if they have different numbers of entries, in this paper, we propose to first
formulate the optimization of the heatmap prediction as a distribution matching problem. Specifically,
we construct two distributions respectively from the predicted heatmap and the dot annotation. After
that, we optimize the heatmap prediction via minimizing the distribution difference based on the
Earth Mover’s Distance. Using such a novel method to optimize the heatmap prediction directly from
the dot annotation, we do not need to construct the Gaussian-smoothed heatmap, as well as avoiding
the issues of the binary dot-annotated heatmap. Thus, our method achieves superior performance.
Our proposed method is simple yet effective, which can be easily applied to various off-the-shelf
2D human pose estimation models by replacing their original loss function with our proposed loss
function measuring the distribution difference between the predicted heatmap and the dot annotation.
We experiment our proposed method on multiple models and our method achieves a consistent model
performance improvement.
The contributions of our work are summarized as follows. 1) We analyze (in Sec. 4) that the
performance of the human pose estimation model may not be consistently improved during the process
of minimizing the pixel-wise loss between the predicted heatmap and the GT Gaussian-smoothed
heatmap. 2) From a novel perspective, we formulate the optimization of the heatmap prediction as a
distribution matching problem between the predicted heatmap and the GT dot annotation directly,
which bypasses the step of constructing the Gaussian-smoothed heatmap and achieves consistent
model performance improvement. 3) Our proposed method achieves state-of-the-art performance on
the evaluated benchmarks.
2 Related Work
2D Human Pose Estimation.
Due to the wide range of applications, the task of 2D human pose
estimation has received lots of attention [
30
,
2
,
32
,
13
,
21
,
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
,
27
,
7
,
8
]. DeepPose [
30
] made the first attempt of applying deep neural networks into the task
of 2D human pose estimation via directly regressing the coordinates of body joints. This type of
coordinate regression-based methods [
30
,
2
,
32
,
13
] often show inferior performances compared
to the heatmap-based methods [
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
], as the heatmap-based
methods can preserve the spatial structure of the input image throughout the encoding and decoding
process [
7
]. Hence, recently, the great majority of the state-of-the-art methods [
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
] regard 2D human pose estimation as a heatmap estimation problem
instead of the coordinate regression problem. Among the heatmap-based methods, Tompson et al.
[
29
] proposed to apply Markov Random Field (MRF) into the task of 2D human pose estimation.
After that, an "hourglass" network, with a conv-deconv architecture, was proposed by Newell et
al. [
20
]. Xiao et al. [
33
] proposed a baseline method to predict the heatmap via adding several
deconvolutional layers to a backbone network. Later on, to maintain high-resolution representations
throughout the heatmap estimation process, HRNet was proposed by Sun et al. [
26
]. Yuan et al.
[
37
] further proposed HRFormer to learn the high-resolution representations utilizing a transformer-
based architecture. Besides the above heatmap-based methods that use the Gaussian-smoothed
heatmap as the optimization objective, there are also some methods [
27
,
7
,
8
] that combine the idea
of heatmap and coordinate regression by taking the expectation of the predicted heatmap as the
predicted coordinates.
Here in this work, different from previous works, our method bypasses both the step of regression
and the step of constructing the Gaussian-smoothed heatmap as the optimization objective. Instead,
from a novel perspective, we propose to formulate the optimization of the heatmap prediction as
a distribution matching problem by minimizing the distribution difference between the predicted
heatmap and the dot annotation.
Distribution Matching.
The idea of distribution matching has been studied in various tasks [
24
,
25
,
31
,
38
,
41
,
23
], such as image retrieval [
24
], tracking [
25
], few-shot learning [
38
], and long-tail
recognition [
23
]. In this work, from a novel perspective, we design a new distribution matching
scheme to optimize the heatmap prediction with the help of sub-pixel resolutions for 2D human pose
estimation.
3
摘要:

HeatmapDistributionMatchingforHumanPoseEstimationHaoxuanQuSUTDSingaporehaoxuan_qu@mymail.sutd.edu.sgLiXuSUTDSingaporeli_xu@mymail.sutd.edu.sgYujunCaiNTUSingaporeyujun001@e.ntu.edu.sgLinGengFooSUTDSingaporelingeng_foo@mymail.sutd.edu.sgJunLiuSUTDSingaporejun_liu@sutd.edu.sgAbstractFortacklingthetask...

展开>> 收起<<
Heatmap Distribution Matching for Human Pose Estimation Haoxuan Qu.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:4.91MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注