AMPOSE ALTERNATELY MIXED GLOBAL-LOCAL ATTENTION MODEL FOR 3D HUMAN POSE ESTIMATION Hongxin Lin Yunwei Chiu and Peiyuan Wu_2

2025-04-30 0 0 1.55MB 5 页 10玖币
侵权投诉
AMPOSE: ALTERNATELY MIXED GLOBAL-LOCAL ATTENTION MODEL FOR 3D
HUMAN POSE ESTIMATION
Hongxin Lin, Yunwei Chiu and Peiyuan Wu
National Taiwan University, Taiwan
ABSTRACT
The graph convolutional networks (GCNs) have been applied
to model the physically connected and non-local relations
among human joints for 3D human pose estimation (HPE).
In addition, the purely Transformer-based models recently
show promising results in video-based 3D HPE. However,
the single-frame method still needs to model the physically
connected relations among joints because the feature repre-
sentations transformed only by global relations via the Trans-
former neglect information on the human skeleton. To deal
with this problem, we propose a novel method in which the
Transformer encoder and GCN blocks are alternately stacked,
namely AMPose, to combine the global and physically con-
nected relations among joints towards HPE. In the AMPose,
the Transformer encoder is applied to connect each joint
with all the other joints, while GCNs are applied to capture
information on physically connected relations. The effective-
ness of our proposed method is evaluated on the Human3.6M
dataset. Our model also shows better generalization ability by
testing on the MPI-INF-3DHP dataset. Code can be retrieved
at https://github.com/erikervalid/AMPose.
Index TermsGCNs, Transformer, 3D human pose es-
timation, 2D-3D lifting
1. INTRODUCTION
Human pose estimation (HPE) is attractive to researchers in
computer vision. In particular, 3D HPE is rather related to
real-world applications in human-robot interaction, sports,
and augmented reality. Most previous works built their model
via the 2D-3D lifting method [1, 2, 3], which inferences the
2D pose in images by off-the-shelf 2D pose models [4] first,
and then the 2D pose is taken as the input of lifting models to
predict 3D pose. The method of separating 3D HPE into two
phases can abate the influence of image backgrounds [2].
3D HPE in the video has been developed for several years
[2, 5, 6]. Temporal modeling regards temporal information as
independent tokens, which demands performance computing
to run these models [5, 6, 7]. Considering the computational
cost, single-frame models can be easier for real-world appli-
cations.
Early work has shown that features transformed from 2D
poses can be useful information to estimate 3D positions [2].
To solve the lack of capturing the spatial relationships among
joints, graph convolutional networks (GCNs) have recently
been adopted in many HPE models [3, 8, 9, 10]. A drawback
of the GCNs derived from spectrum convolution is weight-
sharing convolution. Each node in the GCNs is transformed
by the same transformation matrix, and then the neighbor-
ing features will be aggregated to transfer information to the
next layer [11]. The weight-sharing method may not work
well to capture the information on human joints because the
flexibility and speed of human motion vary with joints [3, 9].
Global dependency among joints in single-frame 3D HPE re-
mains unclear. The Transformer-based models in computer
vision have recently shown high performance in various tasks
[12]. Since the self-attention mechanism in the Transformer
can capture the global information among joints [12]. In the
case of 3D HPE, the self-attention mechanism can relate each
joint to all the other joints to obtain global dependency [6].
The self-attention mechanism is suitable to model the simi-
larity relations among joints since the global dependence can
alter with the different input poses [1, 13]. In spite of that,
the purely Transformer-based models may lack the physical
information in the human skeleton [1]. To address this is-
sue, the previous works [1, 14] proposed integrating multi-
hop graph convolution into the Transformer. However, both
the graph convolution with multi-hop range [15] and self-
attention have large receptive fields, which may not explic-
itly capture the local relations in the human skeleton. Thus
we adopt the GCNs [11] to model the physically connected
joints to improve the effectiveness of the model for 3D HPE.
Additionally, following the success of the previous structures
[6, 16, 17], we propose to alternately mix the local and global
information by two independent modules, which are the GCN
block and Transformer encoder, respectively.
Contributions in this paper can be summarized as fol-
lows: 1) On the basis of the global-local attention structure,
we propose to alternately stack the transformer encoders
and GCN blocks, namely AMPose, to capture the global
and local information for 3D HPE. 2) Different designs are
explored for the GCN blocks, which can fully exploit the
physical-connectivity features. 3) Our model outperforms
the state-of-the-art models on the Human3.6M dataset. The
proposed model also shows better generalization ability on
MPI-INF-3DHP as compared to previous works.
arXiv:2210.04216v5 [cs.CV] 31 Oct 2023
摘要:

AMPOSE:ALTERNATELYMIXEDGLOBAL-LOCALATTENTIONMODELFOR3DHUMANPOSEESTIMATIONHongxinLin,YunweiChiuandPeiyuanWuNationalTaiwanUniversity,TaiwanABSTRACTThegraphconvolutionalnetworks(GCNs)havebeenappliedtomodelthephysicallyconnectedandnon-localrelationsamonghumanjointsfor3Dhumanposeestimation(HPE).Inadditio...

展开>> 收起<<
AMPOSE ALTERNATELY MIXED GLOBAL-LOCAL ATTENTION MODEL FOR 3D HUMAN POSE ESTIMATION Hongxin Lin Yunwei Chiu and Peiyuan Wu_2.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:1.55MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注