AMPOSE ALTERNATELY MIXED GLOBAL-LOCAL ATTENTION MODEL FOR 3D HUMAN POSE ESTIMATION Hongxin Lin Yunwei Chiu and Peiyuan Wu_2

2025-04-30 0 0 1.55MB 5 页 10玖币

侵权投诉

AMPOSE: ALTERNATELY MIXED GLOBAL-LOCAL ATTENTION MODEL FOR 3D

HUMAN POSE ESTIMATION

Hongxin Lin, Yunwei Chiu and Peiyuan Wu

National Taiwan University, Taiwan

ABSTRACT

The graph convolutional networks (GCNs) have been applied

to model the physically connected and non-local relations

among human joints for 3D human pose estimation (HPE).

In addition, the purely Transformer-based models recently

show promising results in video-based 3D HPE. However,

the single-frame method still needs to model the physically

connected relations among joints because the feature repre-

sentations transformed only by global relations via the Trans-

former neglect information on the human skeleton. To deal

with this problem, we propose a novel method in which the

Transformer encoder and GCN blocks are alternately stacked,

namely AMPose, to combine the global and physically con-

nected relations among joints towards HPE. In the AMPose,

the Transformer encoder is applied to connect each joint

with all the other joints, while GCNs are applied to capture

information on physically connected relations. The effective-

ness of our proposed method is evaluated on the Human3.6M

dataset. Our model also shows better generalization ability by

testing on the MPI-INF-3DHP dataset. Code can be retrieved

at https://github.com/erikervalid/AMPose.

Index Terms—GCNs, Transformer, 3D human pose es-

timation, 2D-3D lifting

1. INTRODUCTION

Human pose estimation (HPE) is attractive to researchers in

computer vision. In particular, 3D HPE is rather related to

real-world applications in human-robot interaction, sports,

and augmented reality. Most previous works built their model

via the 2D-3D lifting method [1, 2, 3], which inferences the

2D pose in images by off-the-shelf 2D pose models [4] ﬁrst,

and then the 2D pose is taken as the input of lifting models to

predict 3D pose. The method of separating 3D HPE into two

phases can abate the inﬂuence of image backgrounds [2].

3D HPE in the video has been developed for several years

[2, 5, 6]. Temporal modeling regards temporal information as

independent tokens, which demands performance computing

to run these models [5, 6, 7]. Considering the computational

cost, single-frame models can be easier for real-world appli-

cations.

Early work has shown that features transformed from 2D

poses can be useful information to estimate 3D positions [2].

To solve the lack of capturing the spatial relationships among

joints, graph convolutional networks (GCNs) have recently

been adopted in many HPE models [3, 8, 9, 10]. A drawback

of the GCNs derived from spectrum convolution is weight-

sharing convolution. Each node in the GCNs is transformed

by the same transformation matrix, and then the neighbor-

ing features will be aggregated to transfer information to the

next layer [11]. The weight-sharing method may not work

well to capture the information on human joints because the

ﬂexibility and speed of human motion vary with joints [3, 9].

Global dependency among joints in single-frame 3D HPE re-

mains unclear. The Transformer-based models in computer

vision have recently shown high performance in various tasks

[12]. Since the self-attention mechanism in the Transformer

can capture the global information among joints [12]. In the

case of 3D HPE, the self-attention mechanism can relate each

joint to all the other joints to obtain global dependency [6].

The self-attention mechanism is suitable to model the simi-

larity relations among joints since the global dependence can

alter with the different input poses [1, 13]. In spite of that,

the purely Transformer-based models may lack the physical

information in the human skeleton [1]. To address this is-

sue, the previous works [1, 14] proposed integrating multi-

hop graph convolution into the Transformer. However, both

the graph convolution with multi-hop range [15] and self-

attention have large receptive ﬁelds, which may not explic-

itly capture the local relations in the human skeleton. Thus

we adopt the GCNs [11] to model the physically connected

joints to improve the effectiveness of the model for 3D HPE.

Additionally, following the success of the previous structures

[6, 16, 17], we propose to alternately mix the local and global

information by two independent modules, which are the GCN

block and Transformer encoder, respectively.

Contributions in this paper can be summarized as fol-

lows: 1) On the basis of the global-local attention structure,

we propose to alternately stack the transformer encoders

and GCN blocks, namely AMPose, to capture the global

and local information for 3D HPE. 2) Different designs are

explored for the GCN blocks, which can fully exploit the

physical-connectivity features. 3) Our model outperforms

the state-of-the-art models on the Human3.6M dataset. The

proposed model also shows better generalization ability on

MPI-INF-3DHP as compared to previous works.

arXiv:2210.04216v5 [cs.CV] 31 Oct 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AMPOSE:ALTERNATELYMIXEDGLOBAL-LOCALATTENTIONMODELFOR3DHUMANPOSEESTIMATIONHongxinLin,YunweiChiuandPeiyuanWuNationalTaiwanUniversity,TaiwanABSTRACTThegraphconvolutionalnetworks(GCNs)havebeenappliedtomodelthephysicallyconnectedandnon-localrelationsamonghumanjointsfor3Dhumanposeestimation(HPE).Inadditio...

展开>> 收起<<

AMPOSE ALTERNATELY MIXED GLOBAL-LOCAL ATTENTION MODEL FOR 3D HUMAN POSE ESTIMATION Hongxin Lin Yunwei Chiu and Peiyuan Wu_2.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AMPOSE ALTERNATELY MIXED GLOBAL-LOCAL ATTENTION MODEL FOR 3D HUMAN POSE ESTIMATION Hongxin Lin Yunwei Chiu and Peiyuan Wu_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: