
3
it has superior performance. For convolution-based methods,
HRNet [14] maintains high-resolution features and repeatedly
fuses multi-resolution features through the whole process to
generate reliable high-resolution representations. Su et al. [40]
propose a Channel Shuffle Module and Spatial, Channel-
wise Attention Residual Bottleneck (SCARB) to drive the
cross-channel information flow. Zhao et at. [3] leverage a
quality prediction block (OKS-net) to regress object keypoint
similarity, which builds the direct awareness of the predicted
pose quality. For transformer-based network, TokenPose [31]
embeds each keypoint as a token to simultaneously learn
constraint relationships across keypoints and visual represen-
tation from images. Other researches [12], [55] try to handle
quantization error and occlusion issue. However, the detection-
first paradigm always brings additional computational cost and
forward time, top-down methods are often not feasible for the
real-time systems with strict latency constraints.
Bottom-up Methods. In contrast to top-down methods,
bottom-up methods [1], [8], [23], [26], [28], [29], [37], [56],
[57] first localize keypoints of all human instances in the
input image and then group them to the corresponding person.
Bottom-up methods mainly concentrate on the effective group-
ing process or tackling with the scale variation. For example,
CMU-pose [23] proposes a non parametric representation,
named Part Affinity Fields (PAFs), which encodes the location
and orientation of limbs, to group the keypoints to individuals.
AE [26] simultaneously outputs a keypoint heatmap and a
tag map for each body joint, then assigns the keypoints with
similar tags into individual. HigherHRNet [29] generates high-
resolution feature pyramid with multi-resolution supervision
and multi-resolution heatmap aggregation for learning scale-
aware representations. Li et al. [1] exploit an encoding-
decoding network with multi-scale Gaussian heatmaps and
guiding offset fields to represent multi-person pose informa-
tion, and introduce an auxiliary task of peak regularization into
heatmap supervision for improving performance. However,
one case worth noting is that the grouping process serves as a
post-process is still computationally complex and redundant.
Point-based methods. Before the deep learning era, in
contrast to most of methods use pictorial structure model
[63] for pose estimation, RoDG [64] leverages a pre-defined
dependency graph representing relationships between adjacent
body joints for pose estimation, the positions of these adjacent
points are sequentially estimated along the dependency paths
(skeleton path) from the root node. In deep learning era, the
point-based methods [19], [22], [33]–[36] represent the in-
stances by the grid points and have been applied in many tasks.
They have drawn much attention as they are always simpler
and more efficient than anchor-based representation [9], [11],
[15], [16]. CenterNet [19] leverages bounding box center to
encode the object information and regresses the other object
properties such as size to predict bounding box in parallel.
SPM [20] represents the person via root joint and further
presents a fixed hierarchical body representation to estimate
human pose. Point-Set Anchors [35] propose to leverage a
set of pre-defined points as pose anchor to provide more
informative features for regression. In contrast to previous
methods that use center or pre-defined pose anchor to model
Fig. 3. (a) The visualization of adaptive point set. White points indicate the
human center and others refer to part related points visualized by different
colors. We leverage an adaptive point set conditioned on each human instance
to represent the human pose in a fine-grained way. (b) Divided human parts
according to inherent body structure. (c) Black dotted arrows indicate the bone
connections of cross parts and solid lines refer to bone connections of inner
parts.
human instance, we propose to represent human instance via
an adaptive point set including center and seven human-part
related points as shown in Fig. 3 (a). The novel representation
is able to capture the diverse pose information and effectively
model the connections between human instance and keypoints.
III. METHODOLOGY
First, we elaborate on the proposed body representation
in subsection III-A. Then, subsection III-B gives a minute
description of network architecture including Part Perception
Module, Enhanced Center-aware Branch as well as Two-
hop Regression Branch. Finally, we report the training and
inference details in subsection III-C.
A. Body Representation
We present an adaptive point set representation that uses the
center point together with several human-part related points
to represent the human instance. The proposed representation
introduces the adaptive human-part related points, whose fea-
tures are used to encode the per-part information thus can suf-
ficiently capture the structural pose information. Meanwhile,
they serve as the intermediate nodes to effectively model
the relationship between the human instance and keypoints.
In contrast to the fixed hierarchical representation in SPM
[20], The adaptive part related points are predicted by center
feature dynamically and not pre-defined locations, thus avoid
the accumulated error propagated along the fixed hierarchical
path. Furthermore, instead of only using the root feature to
encode the whole pose information, the features of adaptive
points are also leveraged to encode keypoint information of
different parts more sufficiently in our method.
Our body representation is built upon the pixel-wise key-
point regression framework, which estimates the candidate
pose at each pixel. For a human instance, we manually divide
the human body into seven parts (i.e., head, shoulder, left arm,
right arm, hip, left leg and right leg) according to the inherent
structure of human body, as shown in Fig. 3 (b). Each divided
human part is a rigid structure, we represent it via an adaptive
human-part related point, which is dynamically regressed from
the human center. The process can be formulated as: