Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition Zhimin Gao1 Peitao Wang1 Pei Lv1 Xiaoheng Jiang1 Qidong Liu1 Pichao

2025-04-27 0 0 820.91KB 17 页 10玖币
侵权投诉
Focal and Global Spatial-Temporal Transformer
for Skeleton-based Action Recognition
Zhimin Gao1, Peitao Wang1, Pei Lv1, Xiaoheng Jiang1, Qidong Liu1, Pichao
Wang2, Mingliang Xu1?, and Wanqing Li3
1Zhengzhou University, Zhengzhou, China
zhimingao113@gmail.com,wptao 98@163.com,
{ielvpei,jiangxiaoheng,ieqdliu,iexumingliang}@zzu.edu.cn
2DAMO Academy, Alibaba Group (U.S.) Inc
pichaowang@gmail.com
3AMRL, University of Wollongong, Wollongong, Australia
wanqing@uow.edu.au
Abstract. Despite great progress achieved by transformer in various vi-
sion tasks, it is still underexplored for skeleton-based action recognition
with only a few attempts. Besides, these methods directly calculate the
pair-wise global self-attention equally for all the joints in both the spatial
and temporal dimensions, undervaluing the effect of discriminative local
joints and the short-range temporal dynamics. In this work, we propose
a novel Focal and Global Spatial-Temporal Transformer network (FG-
STFormer), that is equipped with two key components: (1) FG-SFormer:
focal joints and global parts coupling spatial transformer. It forces the
network to focus on modelling correlations for both the learned discrim-
inative spatial joints and human body parts respectively. The selective
focal joints eliminate the negative effect of non-informative ones during
accumulating the correlations. Meanwhile, the interactions between the
focal joints and body parts are incorporated to enhance the spatial de-
pendencies via mutual cross-attention. (2) FG-TFormer: focal and global
temporal transformer. Dilated temporal convolution is integrated into
the global self-attention mechanism to explicitly capture the local tem-
poral motion patterns of joints or body parts, which is found to be vital
important to make temporal transformer work. Extensive experimental
results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA,
show our FG-STFormer surpasses all existing transformer-based meth-
ods, and compares favourably with state-of-the-art GCN-based methods.
Keywords: Action recognition ·Skeleton ·Spatial-temporal transformer
·Focal joints ·Motion patterns.
1 Introduction
Human action recognition has long been a crucial and active research field in
video understanding since it has a broad range of applications, such as human-
?Corresponding author
arXiv:2210.02693v1 [cs.CV] 6 Oct 2022
2 Z. Gao et al.
computer interaction, intelligent video surveillance and robotics [34,4,44]. In re-
cent years, skeleton-based action recognition has gained increasing attention with
advent of cost-effective depth cameras like Microsoft Kinect [52] and advanced
pose estimation techniques [2], which make skeleton data more accurate and ac-
cessible. By representing the action as a sequence of joint coordinates of human
body, the highly abstracted skeleton data is compact and robust to illumination,
human appearance changes and background noises.
Effectively modelling the spatial-temporal correlations and dynamics of joints
is crucial for recognizing actions from skeleton sequences. The dominant solu-
tions to it in recent years are the graph convolutional networks (GCNs) [46],
as they can model the irregular topology of the human skeleton. Via design-
ing advanced graph topology or traversal rules, the recognition performance is
greatly improved by GCN-based methods [30,40]. Meanwhile, the recent suc-
cess of Transformer [41] has gained significant interest and performance boost
in various computer vision tasks [9,29,3,32]. For skeleton-based action recogni-
tion, one would expect that the self-attention mechanism in transformer shall
naturally capture effective correlations of joints in both spatial and temporal di-
mensions for action categorization, without enforcing the articulating constrains
of human body like GCN. However, there are only a few transformer-based at-
tempts [38,33,51], and they devise hybrid model of GCN and transformer [33]
or multi-task learning framework [51]. How to utilize self-attention to learn ef-
fective spatial-temporal relations of joints and representative motion features
is still a thorny problem. Moreover, most of these Transformer based methods
directly calculate the global one-to-one relations of joints for spatial and tempo-
ral dimensions respectively. Such strategy undervalues the spatial interactions
of discriminative local joints and short-term temporal dynamics for identifying
crucial action-related patterns. On the one hand, since not all joints are infor-
mative for recognizing actions [27,16], these methods suffer from the influence
of irrelevant or noisy joints by accumulating the correlations with them via at-
tention mechanism, which could harm the recognition. On the other hand, with
the fact that the vanilla transformer lacks of inductive bias [29] to capture the
locality of temporal structural data, it is difficult for these methods to directly
model effective temporal relations of joints globally over long input sequence.
To tackle these issues, we propose a novel end-to-end Focal and Global
Spatial-Temporal Transformer network, dubbed as FG-STFormer, to effec-
tively capture relations of the crucial local joints and the global contextual
information in both spatial and temporal dimensions for skeleton-based action
recognition. It is composed of two components: FG-SFormer and FG-TFormer.
Intuitively, each action can be distinguished by the co-movement of: (1) some
critical local joints, (2) global body parts, and (or) (3) joint-part interaction. For
example, as shown in Fig. 1, actions such as taking a selfie and kicking mainly
involve important joints of hands, head and feet, as well as related body parts
of arms and legs, while the actions like sit down primarily require understand-
ing of body parts cooperation and dynamics. Based on the above observations,
at the late stage of the network, we adaptively sample the informative spa-
FG-STFormer for Skeleton-based Action Recognition 3
tial local joints (focal joints) for each action, and force the network to focus
on modelling the correlations among them via multi-head self-attention with-
out involving non-informative joints. Meanwhile, in order to compensate for the
missing global co-movement and spatial structure information, we incorporate
the dependencies among human body parts using self-attention. Furthermore,
interactions between the body parts and the focal joints are explicitly modelled
via mutual cross-attention to enhance their spatial collaboration. All of these
are achieved by the proposed FG-SFormer.
Fig. 1. The proposed FG-SFormer (bottom) learns correlations for both adaptively
selected focal joints and body parts, as well as the joint-part interactions via cross-
attention. FG-TFormer (top) models the explicit local temporal relations of joints or
parts, as well as the global temporal dynamics.
The FG-TFormer is designed to model the temporal dynamics of joints or
body parts. It is found that straightforwardly using the vanilla temporal trans-
former leads to ineffective temporal relations and poor recognition performance.
We found one of the key culprits lying in the absence of local bias, making it
challenging for transformer to focus on effective temporal motion patterns in the
long input. Taking these factors into consideration, we integrate the dilated tem-
poral convolutions into multi-head self-attention mechanism to explicitly encode
the short-term temporal motions of a joint or part from their neighbors re-
spectively, which equips transformer with local inductive bias. The short-range
feature representations of all the frames are further fused by the global self-
attention weights to embrace the global contextual motion information into the
representations. Thus, the designed strategy enables transformer to learn both
important local and effective global temporal relations of the joints and human
body parts in a unified structure, which is validated critical to make temporal
transformer work.
To summarize, the contributions of this work lie in four aspects:
1. We propose a novel FG-STFormer network for skeleton-based action recog-
nition, that can effectively capture the discriminative correlations of focal
4 Z. Gao et al.
joints as well as the global contextual motion information in both the spa-
tial and temporal dimensions.
2. We design a focal joints and global parts coupling spatial transformer, namely
FG-SFormer, to model the correlations of adaptively selected focal joints
and that of human body parts. The joint-part mutual cross-attention is in-
tegrated to enhance the spatial collaboration.
3. We introduce a FG-TFormer to explicitly capture both the short and long
range temporal dependencies of the joints and body parts effectively.
4. The extensive experimental results on three datasets highlight the effective-
ness of our method, that surpasses all existing transformer-based methods.
2 Related Work
Skeleton-based Action Recognition. With great progress achieved in skeleton-
based action recognition, existing works can be broadly divided into three groups,
i.e., RNNs, CNNs, and GCNs based methods. RNNs concatenate the coordinates
of all joints in one frame and treat the sequence as time series [10,53,19,49,24].
Some works design specialized network structure, like trees [26,42] to make RNN
aware of spatial information. CNN based methods transform one skeleton se-
quence to a pseudo-image via hand-crafted manners [45,18,22,28,13,11,21], and
then use popular networks to learn spatial and temporal dynamics in it.
The appearance of GCN based methods, like ST-GCN [46], enables more
natural spatial topology representation of skeleton joints by organizing them as
a non-Euclidean graph. The spatial correlation is modelled for bone-connected
joints. As the fixed graph topology (or adjacency matrix) is not flexible to model
the dependencies among spatially disconnected joints, many subsequent methods
focus on designing high-order or multi-scale adjacency matrix [23,12,30,15,20],
and dynamically adjusted graph topology [37,23,48,50,5]. Nevertheless, these
manually devised joint traversal rules limit the flexibility to learn more effective
spatial-temporal dynamics of joints for action recognition.
Transformer based Methods. Several recent works extend Transformer [41]
to spatial and temporal dimensions of skeleton-based action recognition. Among
them, DSTA [38] is the first to use self-attention to learn joint relations, whereas
in practice spatial transformer interleaved with temporal convolution is em-
ployed for some typical datasets. ST-TR [33] adopts a hybrid architecture of
GCN and transformer in a two-stream network, with each stream replacing the
GCN or temporal convolution with spatial or temporal self-attention. STST [51]
introduces a transformer network that the spatial and temporal dimensions are
parallelly separated. Besides, the network is trained together with multi-task
self-supervised learning tasks.
3 Proposed Method
In this section, we first briefly review the basic spatial and temporal Transformer
blocks (referred to as Basic-SFormer and Basic-TFormer blocks respectively)
摘要:

FocalandGlobalSpatial-TemporalTransformerforSkeleton-basedActionRecognitionZhiminGao1,PeitaoWang1,PeiLv1,XiaohengJiang1,QidongLiu1,PichaoWang2,MingliangXu1?,andWanqingLi31ZhengzhouUniversity,Zhengzhou,Chinazhimingao113@gmail.com,wptao98@163.com,fielvpei,jiangxiaoheng,ieqdliu,iexumingliangg@zzu.edu.c...

展开>> 收起<<
Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition Zhimin Gao1 Peitao Wang1 Pei Lv1 Xiaoheng Jiang1 Qidong Liu1 Pichao.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:820.91KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注