1 AdaptivePose A Powerful Single-Stage Network for Multi-Person Pose Regression

2025-04-28 0 0 4.02MB 13 页 10玖币
侵权投诉
1
AdaptivePose++: A Powerful Single-Stage Network
for Multi-Person Pose Regression
Yabo Xiao, Xiaojuan Wang, Dongdong Yu, Kai Su, Lei Jin, Mei Song, Shuicheng Yan, Fellow, IEEE, Jian
Zhao,Member, IEEE
Abstract—Multi-person pose estimation generally follows top-
down and bottom-up paradigms. Both of them use an extra
stage (e.g., human detection in top-down paradigm or grouping
process in bottom-up paradigm) to build the relationship between
the human instance and corresponding keypoints, thus leading
to the high computation cost and redundant two-stage pipeline.
To address the above issue, we propose to represent the human
parts as adaptive points and introduce a fine-grained body
representation method. The novel body representation is able to
sufficiently encode the diverse pose information and effectively
model the relationship between the human instance and corre-
sponding keypoints in a single-forward pass. With the proposed
body representation, we further deliver a compact single-stage
multi-person pose regression network, termed as AdaptivePose.
During inference, our proposed network only needs a single-step
decode operation to form the multi-person pose without complex
post-processes and refinements. We employ AdaptivePose for
both 2D/3D multi-person pose estimation tasks to verify the
effectiveness of AdaptivePose. Without any bells and whistles,
we achieve the most competitive performance on MS COCO and
CrowdPose in terms of accuracy and speed. Furthermore, the out-
standing performance on MuCo-3DHP and MuPoTS-3D further
demonstrates the effectiveness and generalizability on 3D scenes.
Code is available at https://github.com/buptxyb666/AdaptivePose.
Index Terms—Fine-grained, Adaptive point, Single-stage re-
gression, 2D/3D multi-person pose estimation.
I. INTRODUCTION
HUMAN pose estimation (HPE) [13], [17], [23], [42]
a is classical yet challenging task in computer vision
communities. It aims to locate the person keypoints from the
natural image. HPE always serves as the necessary step for
high-level vision tasks such as action recognition [4], [5], [43],
[44] and pose tracking [2], [18], etc. Existing 2D/3D multi-
person pose estimation methods can be categorized into top-
down [3], [10], [11], [13], [14], [39], [40] and bottom-up [1],
[8], [23], [26], [28], [29], [37], [56], [57] paradigms. The top-
down strategy divides this problem into human detection and
Yabo Xiao, Xiaojuan Wang, Lei Jin, Mei Song are with School of Electronic
Engineering, Beijing University of Posts and Telecommunications, Beijing,
China.Email: {xiaoyabo, wj2718, songm}@bupt.edu.cn
Dongdong Yu is with OPPO Research Institute, Beijing, China. Email:
yudongdong@oppo.com
kai Su is with ByteDance Inc., Beijing, China. Email:
sukai@bytedance.com
Shuicheng Yan is with Sea AI Lab (SAIL), Singapore. Email:
yansc@sea.com
Jian Zhao is with Institute of North Electronic Equipment, Beijing, China
and Department of Mathematics and Theories, Peng Cheng Laboratory,
Shenzhen, China. Email: zhaojian90@u.nus.edu
Corresponding authors: Xiaojuan Wang and Jian Zhao
Fig. 1. Inference time (s) vs. precision (COCO keypoint AP). Our method
achieves the best speed-accuracy trade-offs compared with previous methods
on MS COCO [24].
single-person pose estimation, each detected human region is
cropped and normalized to locate the single-person keypoints.
It achieves the superior performance while suffers the large
computation cost and low efficiency due to the additional
human detector. The bottom-up strategy formulates this task
as keypoint localization and grouping process. It firstly detects
all person keypoints simultaneously on the full image instead
of the cropped single-person regions and then assigns them to
individuals. Although bottom-up methods are more efficient
than top-down methods, the heuristic grouping process is still
computationally complex, and always involves many hand-
designed rules.
Both top-down and bottom-up methods generally use the
conventional keypoint heatmap representation that models the
human pose via absolute keypoint position, as shown in Fig.
2 (a), which separates the relationship between the position of
human instance and corresponding keypoints. Consequently,
an extra stage is required to build up the connections. Recent
research works have tried to model the connections between
human body and corresponding keypoints in a single-forward
process while suffering some obstacles, thus leading to the
compromising performance. As shown in Fig. 2 (b), CenterNet
[19] represents the instance as center point and encodes the
relationship between instance and its keypoints via center-to-
joint offsets. Nevertheless, it achieves inferior performance
since the limited center feature can not encode the various
pose effectively. As shown in Fig. 2 (c), SPM [20] also
arXiv:2210.04014v1 [cs.CV] 8 Oct 2022
2
(a) (d)
(b) (c)
Fig. 2. (a) Conventional body representation generally used in top-down
methods such as Rmpe [39] as well as bottom-up methods such as CMU-pose
[23]. (b) Center-to-joint body representation proposed by CenterNet [19]. (c)
Hierarchical body representation introduced by SPM [20]. (d) Our adaptive
point set representation. In contrast to (b) and (c) only using center or root
features, the features of adaptive points are introduced to encode the keypoint
information in each part.
represents the human instance via the limited feature of root
joint and further employs a fixed hierarchical structure along
the skeleton path to build the relationship between the human
instance and keypoints. Due to the intermediate nodes are pre-
defined and the supervision acting on the offsets between the
adjacent joints, thus the fixed hierarchical path will lead to
accumulated errors along the hierarchical path.
To address the aforementioned problems, in this work,
we propose a novel body representation which is able to
sufficiently encode various human pose and effectively build
the relations between the instance and keypoints in a single-
forward pass. Specifically, human body is divided into several
parts and each human part is represented as an adaptive
part related point. In this manner, we leverage the human
center feature together with the features at several human-
part related points to represent diverse human pose. The
connections can be built by the center to adaptive points then
to keypoints path as shown in Fig. 2 (d). Compared with
previous representations, our representation brings two-fold
benefits as follows: 1)The proposed point set representation
introduces additional features at adaptive part related points,
which are able to encode more informative features for flexible
pose compared with limited center representation. 2) The
adaptive part related points serves as relay nodes can more
effectively model the associations between human instance and
corresponding keypoints in a single-forward pass.
With the adaptive point set representation, we propose an
effective and efficient single-stage differentiable regression
network, termed AdaptivePose, which mainly consists of three
novel components. First, we introduce the Part Perception
Module to regress seven adaptive human-part related points
for perceiving corresponding seven human parts. Second, in
contrast to using the limited feature with fixed receptive field
to predict the human center, we propose the Enhanced Center-
aware Branch to conduct the receptive field adaptation by
aggregating the features of adaptive human-part related points
to perceive the center of various pose more precisely. Third,
we propose the Two-hop Regression Branch together with the
Skeleton-Aware Regression Loss for regressing keypoints. The
adaptive human-part related points act as one-hop nodes to
factorize the center-to-joint offsets dynamically.
A preliminary version of this work [62] was accepted in
AAAI Conference on Artificial Intelligence (AAAI) 2022.
We extend it in terms of five aspects: 1) We augment the
content of the Abstract, Introduction, Related Work, Method-
ology and Experiments to cover sufficient details for clearer
and more comprehensive presentation. 2) We improve the
regression loss and add an additional loss term to learn the
bone connections of inner parts and cross parts, which is
helpful for crowd scene. 3) We tune several hyper-parameters
and improve the performance in single forward pass, and add
more ablation experiments with analyses to verify the superior
positioning capacity of our framework. We further report the
more comprehensive comparisons with competitive bottom-up
counterparts and list more qualitative results. 4) We report the
state-of-the-art results on the CrowdPose [12], which contains
an enormous number of crowd scenes. 5) We keep the 2D
framework and add the depth estimation components, further
extent our method to 3D multi-person pose estimation task, the
promising results on MuPoTS-3D [51] verify the effectiveness
and generalizability of our method in 3D scenes.
We summarize our main contributions as follows:
We propose to represent human parts as points thus the
human body can be represented via an adaptive point set
including center and several human-part related points.
To our best knowledge, we are the first to present a fine-
gained and adaptive body representation to sufficiently
encode the pose information and effectively build up the
relation between the human instance and keypoints in a
single-forward pass.
Based on the novel representation, we exploit a compact
single-stage differentiable network, termed as Adaptive-
Pose. Specifically, we introduce a novel Part Perception
Module to perceive the human parts by regressing seven
human-part related points. By manipulating human-part
related points, we further propose the Enhanced Center-
aware Branch to more precisely perceive the human
center and the Two-hop Regression Branch together with
the Skeleton-Aware Regression Loss to precisely regress
the keypoints.
Our method significantly simplifies the pipeline of ex-
isting multi-person pose estimation methods. The effec-
tiveness is demonstrated on both 2D, 3D pose estimation
benchmarks. We achieves the best speed-accuracy trade-
offs without complex refinements and post-processes.
Furthermore, extended experiments on CrowdPose and
MuPoTS-3D clearly verify the generalizability on crowd
and 3D scenes.
II. RELATED WORK
In this section, we review three parts related to our method
including top-down methods, bottom-up methods and point-
based methods.
Top-down Methods. Given an arbitrary RGB image, the
top-down methods [3], [10], [11], [13], [14], [39], [40] crop
and resize the region of detected person firstly and then locate
the single-person keypoints in each cropped area. The detected
human areas are cropped and resized to a unified size so that
3
it has superior performance. For convolution-based methods,
HRNet [14] maintains high-resolution features and repeatedly
fuses multi-resolution features through the whole process to
generate reliable high-resolution representations. Su et al. [40]
propose a Channel Shuffle Module and Spatial, Channel-
wise Attention Residual Bottleneck (SCARB) to drive the
cross-channel information flow. Zhao et at. [3] leverage a
quality prediction block (OKS-net) to regress object keypoint
similarity, which builds the direct awareness of the predicted
pose quality. For transformer-based network, TokenPose [31]
embeds each keypoint as a token to simultaneously learn
constraint relationships across keypoints and visual represen-
tation from images. Other researches [12], [55] try to handle
quantization error and occlusion issue. However, the detection-
first paradigm always brings additional computational cost and
forward time, top-down methods are often not feasible for the
real-time systems with strict latency constraints.
Bottom-up Methods. In contrast to top-down methods,
bottom-up methods [1], [8], [23], [26], [28], [29], [37], [56],
[57] first localize keypoints of all human instances in the
input image and then group them to the corresponding person.
Bottom-up methods mainly concentrate on the effective group-
ing process or tackling with the scale variation. For example,
CMU-pose [23] proposes a non parametric representation,
named Part Affinity Fields (PAFs), which encodes the location
and orientation of limbs, to group the keypoints to individuals.
AE [26] simultaneously outputs a keypoint heatmap and a
tag map for each body joint, then assigns the keypoints with
similar tags into individual. HigherHRNet [29] generates high-
resolution feature pyramid with multi-resolution supervision
and multi-resolution heatmap aggregation for learning scale-
aware representations. Li et al. [1] exploit an encoding-
decoding network with multi-scale Gaussian heatmaps and
guiding offset fields to represent multi-person pose informa-
tion, and introduce an auxiliary task of peak regularization into
heatmap supervision for improving performance. However,
one case worth noting is that the grouping process serves as a
post-process is still computationally complex and redundant.
Point-based methods. Before the deep learning era, in
contrast to most of methods use pictorial structure model
[63] for pose estimation, RoDG [64] leverages a pre-defined
dependency graph representing relationships between adjacent
body joints for pose estimation, the positions of these adjacent
points are sequentially estimated along the dependency paths
(skeleton path) from the root node. In deep learning era, the
point-based methods [19], [22], [33]–[36] represent the in-
stances by the grid points and have been applied in many tasks.
They have drawn much attention as they are always simpler
and more efficient than anchor-based representation [9], [11],
[15], [16]. CenterNet [19] leverages bounding box center to
encode the object information and regresses the other object
properties such as size to predict bounding box in parallel.
SPM [20] represents the person via root joint and further
presents a fixed hierarchical body representation to estimate
human pose. Point-Set Anchors [35] propose to leverage a
set of pre-defined points as pose anchor to provide more
informative features for regression. In contrast to previous
methods that use center or pre-defined pose anchor to model
(a) (b) (c)
Fig. 3. (a) The visualization of adaptive point set. White points indicate the
human center and others refer to part related points visualized by different
colors. We leverage an adaptive point set conditioned on each human instance
to represent the human pose in a fine-grained way. (b) Divided human parts
according to inherent body structure. (c) Black dotted arrows indicate the bone
connections of cross parts and solid lines refer to bone connections of inner
parts.
human instance, we propose to represent human instance via
an adaptive point set including center and seven human-part
related points as shown in Fig. 3 (a). The novel representation
is able to capture the diverse pose information and effectively
model the connections between human instance and keypoints.
III. METHODOLOGY
First, we elaborate on the proposed body representation
in subsection III-A. Then, subsection III-B gives a minute
description of network architecture including Part Perception
Module, Enhanced Center-aware Branch as well as Two-
hop Regression Branch. Finally, we report the training and
inference details in subsection III-C.
A. Body Representation
We present an adaptive point set representation that uses the
center point together with several human-part related points
to represent the human instance. The proposed representation
introduces the adaptive human-part related points, whose fea-
tures are used to encode the per-part information thus can suf-
ficiently capture the structural pose information. Meanwhile,
they serve as the intermediate nodes to effectively model
the relationship between the human instance and keypoints.
In contrast to the fixed hierarchical representation in SPM
[20], The adaptive part related points are predicted by center
feature dynamically and not pre-defined locations, thus avoid
the accumulated error propagated along the fixed hierarchical
path. Furthermore, instead of only using the root feature to
encode the whole pose information, the features of adaptive
points are also leveraged to encode keypoint information of
different parts more sufficiently in our method.
Our body representation is built upon the pixel-wise key-
point regression framework, which estimates the candidate
pose at each pixel. For a human instance, we manually divide
the human body into seven parts (i.e., head, shoulder, left arm,
right arm, hip, left leg and right leg) according to the inherent
structure of human body, as shown in Fig. 3 (b). Each divided
human part is a rigid structure, we represent it via an adaptive
human-part related point, which is dynamically regressed from
the human center. The process can be formulated as:
摘要:

1AdaptivePose++:APowerfulSingle-StageNetworkforMulti-PersonPoseRegressionYaboXiao,XiaojuanWang,DongdongYu,KaiSu,LeiJin,MeiSong,ShuichengYan,Fellow,IEEE,JianZhao,Member,IEEEAbstract—Multi-personposeestimationgenerallyfollowstop-downandbottom-upparadigms.Bothofthemuseanextrastage(e:g:;humandetection...

展开>> 收起<<
1 AdaptivePose A Powerful Single-Stage Network for Multi-Person Pose Regression.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:4.02MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注