annotation are different, we cannot measure the difference between the predicted heatmap and the
dot annotation trivially by measuring their entry-wise difference. To handle this problem, inspired
by the fact that we can measure the difference between two distributions via measuring their Earth
Mover’s Distance even if they have different numbers of entries, in this paper, we propose to first
formulate the optimization of the heatmap prediction as a distribution matching problem. Specifically,
we construct two distributions respectively from the predicted heatmap and the dot annotation. After
that, we optimize the heatmap prediction via minimizing the distribution difference based on the
Earth Mover’s Distance. Using such a novel method to optimize the heatmap prediction directly from
the dot annotation, we do not need to construct the Gaussian-smoothed heatmap, as well as avoiding
the issues of the binary dot-annotated heatmap. Thus, our method achieves superior performance.
Our proposed method is simple yet effective, which can be easily applied to various off-the-shelf
2D human pose estimation models by replacing their original loss function with our proposed loss
function measuring the distribution difference between the predicted heatmap and the dot annotation.
We experiment our proposed method on multiple models and our method achieves a consistent model
performance improvement.
The contributions of our work are summarized as follows. 1) We analyze (in Sec. 4) that the
performance of the human pose estimation model may not be consistently improved during the process
of minimizing the pixel-wise loss between the predicted heatmap and the GT Gaussian-smoothed
heatmap. 2) From a novel perspective, we formulate the optimization of the heatmap prediction as a
distribution matching problem between the predicted heatmap and the GT dot annotation directly,
which bypasses the step of constructing the Gaussian-smoothed heatmap and achieves consistent
model performance improvement. 3) Our proposed method achieves state-of-the-art performance on
the evaluated benchmarks.
2 Related Work
2D Human Pose Estimation.
Due to the wide range of applications, the task of 2D human pose
estimation has received lots of attention [
30
,
2
,
32
,
13
,
21
,
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
,
27
,
7
,
8
]. DeepPose [
30
] made the first attempt of applying deep neural networks into the task
of 2D human pose estimation via directly regressing the coordinates of body joints. This type of
coordinate regression-based methods [
30
,
2
,
32
,
13
] often show inferior performances compared
to the heatmap-based methods [
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
], as the heatmap-based
methods can preserve the spatial structure of the input image throughout the encoding and decoding
process [
7
]. Hence, recently, the great majority of the state-of-the-art methods [
29
,
20
,
33
,
26
,
14
,
4
,
39
,
35
,
15
,
17
,
37
] regard 2D human pose estimation as a heatmap estimation problem
instead of the coordinate regression problem. Among the heatmap-based methods, Tompson et al.
[
29
] proposed to apply Markov Random Field (MRF) into the task of 2D human pose estimation.
After that, an "hourglass" network, with a conv-deconv architecture, was proposed by Newell et
al. [
20
]. Xiao et al. [
33
] proposed a baseline method to predict the heatmap via adding several
deconvolutional layers to a backbone network. Later on, to maintain high-resolution representations
throughout the heatmap estimation process, HRNet was proposed by Sun et al. [
26
]. Yuan et al.
[
37
] further proposed HRFormer to learn the high-resolution representations utilizing a transformer-
based architecture. Besides the above heatmap-based methods that use the Gaussian-smoothed
heatmap as the optimization objective, there are also some methods [
27
,
7
,
8
] that combine the idea
of heatmap and coordinate regression by taking the expectation of the predicted heatmap as the
predicted coordinates.
Here in this work, different from previous works, our method bypasses both the step of regression
and the step of constructing the Gaussian-smoothed heatmap as the optimization objective. Instead,
from a novel perspective, we propose to formulate the optimization of the heatmap prediction as
a distribution matching problem by minimizing the distribution difference between the predicted
heatmap and the dot annotation.
Distribution Matching.
The idea of distribution matching has been studied in various tasks [
24
,
25
,
31
,
38
,
41
,
23
], such as image retrieval [
24
], tracking [
25
], few-shot learning [
38
], and long-tail
recognition [
23
]. In this work, from a novel perspective, we design a new distribution matching
scheme to optimize the heatmap prediction with the help of sub-pixel resolutions for 2D human pose
estimation.
3