1 AdaptivePose A Powerful Single-Stage Network for Multi-Person Pose Regression

2025-04-28 0 0 4.02MB 13 页 10玖币

侵权投诉

AdaptivePose++: A Powerful Single-Stage Network

for Multi-Person Pose Regression

Yabo Xiao, Xiaojuan Wang∗, Dongdong Yu, Kai Su, Lei Jin, Mei Song, Shuicheng Yan, Fellow, IEEE, Jian

Zhao∗,Member, IEEE

Abstract—Multi-person pose estimation generally follows top-

down and bottom-up paradigms. Both of them use an extra

stage (e.g., human detection in top-down paradigm or grouping

process in bottom-up paradigm) to build the relationship between

the human instance and corresponding keypoints, thus leading

to the high computation cost and redundant two-stage pipeline.

To address the above issue, we propose to represent the human

parts as adaptive points and introduce a ﬁne-grained body

representation method. The novel body representation is able to

sufﬁciently encode the diverse pose information and effectively

model the relationship between the human instance and corre-

sponding keypoints in a single-forward pass. With the proposed

body representation, we further deliver a compact single-stage

multi-person pose regression network, termed as AdaptivePose.

During inference, our proposed network only needs a single-step

decode operation to form the multi-person pose without complex

post-processes and reﬁnements. We employ AdaptivePose for

both 2D/3D multi-person pose estimation tasks to verify the

effectiveness of AdaptivePose. Without any bells and whistles,

we achieve the most competitive performance on MS COCO and

CrowdPose in terms of accuracy and speed. Furthermore, the out-

standing performance on MuCo-3DHP and MuPoTS-3D further

demonstrates the effectiveness and generalizability on 3D scenes.

Code is available at https://github.com/buptxyb666/AdaptivePose.

Index Terms—Fine-grained, Adaptive point, Single-stage re-

gression, 2D/3D multi-person pose estimation.

I. INTRODUCTION

HUMAN pose estimation (HPE) [13], [17], [23], [42]

a is classical yet challenging task in computer vision

communities. It aims to locate the person keypoints from the

natural image. HPE always serves as the necessary step for

high-level vision tasks such as action recognition [4], [5], [43],

[44] and pose tracking [2], [18], etc. Existing 2D/3D multi-

person pose estimation methods can be categorized into top-

down [3], [10], [11], [13], [14], [39], [40] and bottom-up [1],

[8], [23], [26], [28], [29], [37], [56], [57] paradigms. The top-

down strategy divides this problem into human detection and

Yabo Xiao, Xiaojuan Wang, Lei Jin, Mei Song are with School of Electronic

Engineering, Beijing University of Posts and Telecommunications, Beijing,

China.Email: {xiaoyabo, wj2718, songm}@bupt.edu.cn

Dongdong Yu is with OPPO Research Institute, Beijing, China. Email:

yudongdong@oppo.com

kai Su is with ByteDance Inc., Beijing, China. Email:

sukai@bytedance.com

Shuicheng Yan is with Sea AI Lab (SAIL), Singapore. Email:

yansc@sea.com

Jian Zhao is with Institute of North Electronic Equipment, Beijing, China

and Department of Mathematics and Theories, Peng Cheng Laboratory,

Shenzhen, China. Email: zhaojian90@u.nus.edu

Corresponding authors: Xiaojuan Wang and Jian Zhao

Fig. 1. Inference time (s) vs. precision (COCO keypoint AP). Our method

achieves the best speed-accuracy trade-offs compared with previous methods

on MS COCO [24].

single-person pose estimation, each detected human region is

cropped and normalized to locate the single-person keypoints.

It achieves the superior performance while suffers the large

computation cost and low efﬁciency due to the additional

human detector. The bottom-up strategy formulates this task

as keypoint localization and grouping process. It ﬁrstly detects

all person keypoints simultaneously on the full image instead

of the cropped single-person regions and then assigns them to

individuals. Although bottom-up methods are more efﬁcient

than top-down methods, the heuristic grouping process is still

computationally complex, and always involves many hand-

designed rules.

Both top-down and bottom-up methods generally use the

conventional keypoint heatmap representation that models the

human pose via absolute keypoint position, as shown in Fig.

2 (a), which separates the relationship between the position of

human instance and corresponding keypoints. Consequently,

an extra stage is required to build up the connections. Recent

research works have tried to model the connections between

human body and corresponding keypoints in a single-forward

process while suffering some obstacles, thus leading to the

compromising performance. As shown in Fig. 2 (b), CenterNet

[19] represents the instance as center point and encodes the

relationship between instance and its keypoints via center-to-

joint offsets. Nevertheless, it achieves inferior performance

since the limited center feature can not encode the various

pose effectively. As shown in Fig. 2 (c), SPM [20] also

arXiv:2210.04014v1 [cs.CV] 8 Oct 2022

(a) (d)

(b) (c)

Fig. 2. (a) Conventional body representation generally used in top-down

methods such as Rmpe [39] as well as bottom-up methods such as CMU-pose

[23]. (b) Center-to-joint body representation proposed by CenterNet [19]. (c)

Hierarchical body representation introduced by SPM [20]. (d) Our adaptive

point set representation. In contrast to (b) and (c) only using center or root

features, the features of adaptive points are introduced to encode the keypoint

information in each part.

represents the human instance via the limited feature of root

joint and further employs a ﬁxed hierarchical structure along

the skeleton path to build the relationship between the human

instance and keypoints. Due to the intermediate nodes are pre-

deﬁned and the supervision acting on the offsets between the

adjacent joints, thus the ﬁxed hierarchical path will lead to

accumulated errors along the hierarchical path.

To address the aforementioned problems, in this work,

we propose a novel body representation which is able to

sufﬁciently encode various human pose and effectively build

the relations between the instance and keypoints in a single-

forward pass. Speciﬁcally, human body is divided into several

parts and each human part is represented as an adaptive

part related point. In this manner, we leverage the human

center feature together with the features at several human-

part related points to represent diverse human pose. The

connections can be built by the center to adaptive points then

to keypoints path as shown in Fig. 2 (d). Compared with

previous representations, our representation brings two-fold

beneﬁts as follows: 1)The proposed point set representation

introduces additional features at adaptive part related points,

which are able to encode more informative features for ﬂexible

pose compared with limited center representation. 2) The

adaptive part related points serves as relay nodes can more

effectively model the associations between human instance and

corresponding keypoints in a single-forward pass.

With the adaptive point set representation, we propose an

effective and efﬁcient single-stage differentiable regression

network, termed AdaptivePose, which mainly consists of three

novel components. First, we introduce the Part Perception

Module to regress seven adaptive human-part related points

for perceiving corresponding seven human parts. Second, in

contrast to using the limited feature with ﬁxed receptive ﬁeld

to predict the human center, we propose the Enhanced Center-

aware Branch to conduct the receptive ﬁeld adaptation by

aggregating the features of adaptive human-part related points

to perceive the center of various pose more precisely. Third,

we propose the Two-hop Regression Branch together with the

Skeleton-Aware Regression Loss for regressing keypoints. The

adaptive human-part related points act as one-hop nodes to

factorize the center-to-joint offsets dynamically.

A preliminary version of this work [62] was accepted in

AAAI Conference on Artiﬁcial Intelligence (AAAI) 2022.

We extend it in terms of ﬁve aspects: 1) We augment the

content of the Abstract, Introduction, Related Work, Method-

ology and Experiments to cover sufﬁcient details for clearer

and more comprehensive presentation. 2) We improve the

regression loss and add an additional loss term to learn the

bone connections of inner parts and cross parts, which is

helpful for crowd scene. 3) We tune several hyper-parameters

and improve the performance in single forward pass, and add

more ablation experiments with analyses to verify the superior

positioning capacity of our framework. We further report the

more comprehensive comparisons with competitive bottom-up

counterparts and list more qualitative results. 4) We report the

state-of-the-art results on the CrowdPose [12], which contains

an enormous number of crowd scenes. 5) We keep the 2D

framework and add the depth estimation components, further

extent our method to 3D multi-person pose estimation task, the

promising results on MuPoTS-3D [51] verify the effectiveness

and generalizability of our method in 3D scenes.

We summarize our main contributions as follows:

•We propose to represent human parts as points thus the

human body can be represented via an adaptive point set

including center and several human-part related points.

To our best knowledge, we are the ﬁrst to present a ﬁne-

gained and adaptive body representation to sufﬁciently

encode the pose information and effectively build up the

relation between the human instance and keypoints in a

single-forward pass.

•Based on the novel representation, we exploit a compact

single-stage differentiable network, termed as Adaptive-

Pose. Speciﬁcally, we introduce a novel Part Perception

Module to perceive the human parts by regressing seven

human-part related points. By manipulating human-part

related points, we further propose the Enhanced Center-

aware Branch to more precisely perceive the human

center and the Two-hop Regression Branch together with

the Skeleton-Aware Regression Loss to precisely regress

the keypoints.

•Our method signiﬁcantly simpliﬁes the pipeline of ex-

isting multi-person pose estimation methods. The effec-

tiveness is demonstrated on both 2D, 3D pose estimation

benchmarks. We achieves the best speed-accuracy trade-

offs without complex reﬁnements and post-processes.

Furthermore, extended experiments on CrowdPose and

MuPoTS-3D clearly verify the generalizability on crowd

and 3D scenes.

II. RELATED WORK

In this section, we review three parts related to our method

including top-down methods, bottom-up methods and point-

based methods.

Top-down Methods. Given an arbitrary RGB image, the

top-down methods [3], [10], [11], [13], [14], [39], [40] crop

and resize the region of detected person ﬁrstly and then locate

the single-person keypoints in each cropped area. The detected

human areas are cropped and resized to a uniﬁed size so that

it has superior performance. For convolution-based methods,

HRNet [14] maintains high-resolution features and repeatedly

fuses multi-resolution features through the whole process to

generate reliable high-resolution representations. Su et al. [40]

propose a Channel Shufﬂe Module and Spatial, Channel-

wise Attention Residual Bottleneck (SCARB) to drive the

cross-channel information ﬂow. Zhao et at. [3] leverage a

quality prediction block (OKS-net) to regress object keypoint

similarity, which builds the direct awareness of the predicted

pose quality. For transformer-based network, TokenPose [31]

embeds each keypoint as a token to simultaneously learn

constraint relationships across keypoints and visual represen-

tation from images. Other researches [12], [55] try to handle

quantization error and occlusion issue. However, the detection-

ﬁrst paradigm always brings additional computational cost and

forward time, top-down methods are often not feasible for the

real-time systems with strict latency constraints.

Bottom-up Methods. In contrast to top-down methods,

bottom-up methods [1], [8], [23], [26], [28], [29], [37], [56],

[57] ﬁrst localize keypoints of all human instances in the

input image and then group them to the corresponding person.

Bottom-up methods mainly concentrate on the effective group-

ing process or tackling with the scale variation. For example,

CMU-pose [23] proposes a non parametric representation,

named Part Afﬁnity Fields (PAFs), which encodes the location

and orientation of limbs, to group the keypoints to individuals.

AE [26] simultaneously outputs a keypoint heatmap and a

tag map for each body joint, then assigns the keypoints with

similar tags into individual. HigherHRNet [29] generates high-

resolution feature pyramid with multi-resolution supervision

and multi-resolution heatmap aggregation for learning scale-

aware representations. Li et al. [1] exploit an encoding-

decoding network with multi-scale Gaussian heatmaps and

guiding offset ﬁelds to represent multi-person pose informa-

tion, and introduce an auxiliary task of peak regularization into

heatmap supervision for improving performance. However,

one case worth noting is that the grouping process serves as a

post-process is still computationally complex and redundant.

Point-based methods. Before the deep learning era, in

contrast to most of methods use pictorial structure model

[63] for pose estimation, RoDG [64] leverages a pre-deﬁned

dependency graph representing relationships between adjacent

body joints for pose estimation, the positions of these adjacent

points are sequentially estimated along the dependency paths

(skeleton path) from the root node. In deep learning era, the

point-based methods [19], [22], [33]–[36] represent the in-

stances by the grid points and have been applied in many tasks.

They have drawn much attention as they are always simpler

and more efﬁcient than anchor-based representation [9], [11],

[15], [16]. CenterNet [19] leverages bounding box center to

encode the object information and regresses the other object

properties such as size to predict bounding box in parallel.

SPM [20] represents the person via root joint and further

presents a ﬁxed hierarchical body representation to estimate

human pose. Point-Set Anchors [35] propose to leverage a

set of pre-deﬁned points as pose anchor to provide more

informative features for regression. In contrast to previous

methods that use center or pre-deﬁned pose anchor to model

(a) (b) (c)

Fig. 3. (a) The visualization of adaptive point set. White points indicate the

human center and others refer to part related points visualized by different

colors. We leverage an adaptive point set conditioned on each human instance

to represent the human pose in a ﬁne-grained way. (b) Divided human parts

according to inherent body structure. (c) Black dotted arrows indicate the bone

connections of cross parts and solid lines refer to bone connections of inner

parts.

human instance, we propose to represent human instance via

an adaptive point set including center and seven human-part

related points as shown in Fig. 3 (a). The novel representation

is able to capture the diverse pose information and effectively

model the connections between human instance and keypoints.

III. METHODOLOGY

First, we elaborate on the proposed body representation

in subsection III-A. Then, subsection III-B gives a minute

description of network architecture including Part Perception

Module, Enhanced Center-aware Branch as well as Two-

hop Regression Branch. Finally, we report the training and

inference details in subsection III-C.

A. Body Representation

We present an adaptive point set representation that uses the

center point together with several human-part related points

to represent the human instance. The proposed representation

introduces the adaptive human-part related points, whose fea-

tures are used to encode the per-part information thus can suf-

ﬁciently capture the structural pose information. Meanwhile,

they serve as the intermediate nodes to effectively model

the relationship between the human instance and keypoints.

In contrast to the ﬁxed hierarchical representation in SPM

[20], The adaptive part related points are predicted by center

feature dynamically and not pre-deﬁned locations, thus avoid

the accumulated error propagated along the ﬁxed hierarchical

path. Furthermore, instead of only using the root feature to

encode the whole pose information, the features of adaptive

points are also leveraged to encode keypoint information of

different parts more sufﬁciently in our method.

Our body representation is built upon the pixel-wise key-

point regression framework, which estimates the candidate

pose at each pixel. For a human instance, we manually divide

the human body into seven parts (i.e., head, shoulder, left arm,

right arm, hip, left leg and right leg) according to the inherent

structure of human body, as shown in Fig. 3 (b). Each divided

human part is a rigid structure, we represent it via an adaptive

human-part related point, which is dynamically regressed from

the human center. The process can be formulated as:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1AdaptivePose++:APowerfulSingle-StageNetworkforMulti-PersonPoseRegressionYaboXiao,XiaojuanWang,DongdongYu,KaiSu,LeiJin,MeiSong,ShuichengYan,Fellow,IEEE,JianZhao,Member,IEEEAbstractMulti-personposeestimationgenerallyfollowstop-downandbottom-upparadigms.Bothofthemuseanextrastage(e:g:;humandetection...

展开>> 收起<<

1 AdaptivePose A Powerful Single-Stage Network for Multi-Person Pose Regression.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 AdaptivePose A Powerful Single-Stage Network for Multi-Person Pose Regression

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: