Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition Zhimin Gao1 Peitao Wang1 Pei Lv1 Xiaoheng Jiang1 Qidong Liu1 Pichao

2025-04-27 0 0 820.91KB 17 页 10玖币

侵权投诉

Focal and Global Spatial-Temporal Transformer

for Skeleton-based Action Recognition

Zhimin Gao1, Peitao Wang1, Pei Lv1, Xiaoheng Jiang1, Qidong Liu1, Pichao

Wang2, Mingliang Xu1?, and Wanqing Li3

1Zhengzhou University, Zhengzhou, China

zhimingao113@gmail.com,wptao 98@163.com,

{ielvpei,jiangxiaoheng,ieqdliu,iexumingliang}@zzu.edu.cn

2DAMO Academy, Alibaba Group (U.S.) Inc

pichaowang@gmail.com

3AMRL, University of Wollongong, Wollongong, Australia

wanqing@uow.edu.au

Abstract. Despite great progress achieved by transformer in various vi-

sion tasks, it is still underexplored for skeleton-based action recognition

with only a few attempts. Besides, these methods directly calculate the

pair-wise global self-attention equally for all the joints in both the spatial

and temporal dimensions, undervaluing the eﬀect of discriminative local

joints and the short-range temporal dynamics. In this work, we propose

a novel Focal and Global Spatial-Temporal Transformer network (FG-

STFormer), that is equipped with two key components: (1) FG-SFormer:

focal joints and global parts coupling spatial transformer. It forces the

network to focus on modelling correlations for both the learned discrim-

inative spatial joints and human body parts respectively. The selective

focal joints eliminate the negative eﬀect of non-informative ones during

accumulating the correlations. Meanwhile, the interactions between the

focal joints and body parts are incorporated to enhance the spatial de-

pendencies via mutual cross-attention. (2) FG-TFormer: focal and global

temporal transformer. Dilated temporal convolution is integrated into

the global self-attention mechanism to explicitly capture the local tem-

poral motion patterns of joints or body parts, which is found to be vital

important to make temporal transformer work. Extensive experimental

results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA,

show our FG-STFormer surpasses all existing transformer-based meth-

ods, and compares favourably with state-of-the-art GCN-based methods.

Keywords: Action recognition ·Skeleton ·Spatial-temporal transformer

·Focal joints ·Motion patterns.

1 Introduction

Human action recognition has long been a crucial and active research ﬁeld in

video understanding since it has a broad range of applications, such as human-

?Corresponding author

arXiv:2210.02693v1 [cs.CV] 6 Oct 2022

2 Z. Gao et al.

computer interaction, intelligent video surveillance and robotics [34,4,44]. In re-

cent years, skeleton-based action recognition has gained increasing attention with

advent of cost-eﬀective depth cameras like Microsoft Kinect [52] and advanced

pose estimation techniques [2], which make skeleton data more accurate and ac-

cessible. By representing the action as a sequence of joint coordinates of human

body, the highly abstracted skeleton data is compact and robust to illumination,

human appearance changes and background noises.

Eﬀectively modelling the spatial-temporal correlations and dynamics of joints

is crucial for recognizing actions from skeleton sequences. The dominant solu-

tions to it in recent years are the graph convolutional networks (GCNs) [46],

as they can model the irregular topology of the human skeleton. Via design-

ing advanced graph topology or traversal rules, the recognition performance is

greatly improved by GCN-based methods [30,40]. Meanwhile, the recent suc-

cess of Transformer [41] has gained signiﬁcant interest and performance boost

in various computer vision tasks [9,29,3,32]. For skeleton-based action recogni-

tion, one would expect that the self-attention mechanism in transformer shall

naturally capture eﬀective correlations of joints in both spatial and temporal di-

mensions for action categorization, without enforcing the articulating constrains

of human body like GCN. However, there are only a few transformer-based at-

tempts [38,33,51], and they devise hybrid model of GCN and transformer [33]

or multi-task learning framework [51]. How to utilize self-attention to learn ef-

fective spatial-temporal relations of joints and representative motion features

is still a thorny problem. Moreover, most of these Transformer based methods

directly calculate the global one-to-one relations of joints for spatial and tempo-

ral dimensions respectively. Such strategy undervalues the spatial interactions

of discriminative local joints and short-term temporal dynamics for identifying

crucial action-related patterns. On the one hand, since not all joints are infor-

mative for recognizing actions [27,16], these methods suﬀer from the inﬂuence

of irrelevant or noisy joints by accumulating the correlations with them via at-

tention mechanism, which could harm the recognition. On the other hand, with

the fact that the vanilla transformer lacks of inductive bias [29] to capture the

locality of temporal structural data, it is diﬃcult for these methods to directly

model eﬀective temporal relations of joints globally over long input sequence.

To tackle these issues, we propose a novel end-to-end Focal and Global

Spatial-Temporal Transformer network, dubbed as FG-STFormer, to eﬀec-

tively capture relations of the crucial local joints and the global contextual

information in both spatial and temporal dimensions for skeleton-based action

recognition. It is composed of two components: FG-SFormer and FG-TFormer.

Intuitively, each action can be distinguished by the co-movement of: (1) some

critical local joints, (2) global body parts, and (or) (3) joint-part interaction. For

example, as shown in Fig. 1, actions such as taking a selﬁe and kicking mainly

involve important joints of hands, head and feet, as well as related body parts

of arms and legs, while the actions like sit down primarily require understand-

ing of body parts cooperation and dynamics. Based on the above observations,

at the late stage of the network, we adaptively sample the informative spa-

FG-STFormer for Skeleton-based Action Recognition 3

tial local joints (focal joints) for each action, and force the network to focus

on modelling the correlations among them via multi-head self-attention with-

out involving non-informative joints. Meanwhile, in order to compensate for the

missing global co-movement and spatial structure information, we incorporate

the dependencies among human body parts using self-attention. Furthermore,

interactions between the body parts and the focal joints are explicitly modelled

via mutual cross-attention to enhance their spatial collaboration. All of these

are achieved by the proposed FG-SFormer.

Fig. 1. The proposed FG-SFormer (bottom) learns correlations for both adaptively

selected focal joints and body parts, as well as the joint-part interactions via cross-

attention. FG-TFormer (top) models the explicit local temporal relations of joints or

parts, as well as the global temporal dynamics.

The FG-TFormer is designed to model the temporal dynamics of joints or

body parts. It is found that straightforwardly using the vanilla temporal trans-

former leads to ineﬀective temporal relations and poor recognition performance.

We found one of the key culprits lying in the absence of local bias, making it

challenging for transformer to focus on eﬀective temporal motion patterns in the

long input. Taking these factors into consideration, we integrate the dilated tem-

poral convolutions into multi-head self-attention mechanism to explicitly encode

the short-term temporal motions of a joint or part from their neighbors re-

spectively, which equips transformer with local inductive bias. The short-range

feature representations of all the frames are further fused by the global self-

attention weights to embrace the global contextual motion information into the

representations. Thus, the designed strategy enables transformer to learn both

important local and eﬀective global temporal relations of the joints and human

body parts in a uniﬁed structure, which is validated critical to make temporal

transformer work.

To summarize, the contributions of this work lie in four aspects:

1. We propose a novel FG-STFormer network for skeleton-based action recog-

nition, that can eﬀectively capture the discriminative correlations of focal

4 Z. Gao et al.

joints as well as the global contextual motion information in both the spa-

tial and temporal dimensions.

2. We design a focal joints and global parts coupling spatial transformer, namely

FG-SFormer, to model the correlations of adaptively selected focal joints

and that of human body parts. The joint-part mutual cross-attention is in-

tegrated to enhance the spatial collaboration.

3. We introduce a FG-TFormer to explicitly capture both the short and long

range temporal dependencies of the joints and body parts eﬀectively.

4. The extensive experimental results on three datasets highlight the eﬀective-

ness of our method, that surpasses all existing transformer-based methods.

2 Related Work

Skeleton-based Action Recognition. With great progress achieved in skeleton-

based action recognition, existing works can be broadly divided into three groups,

i.e., RNNs, CNNs, and GCNs based methods. RNNs concatenate the coordinates

of all joints in one frame and treat the sequence as time series [10,53,19,49,24].

Some works design specialized network structure, like trees [26,42] to make RNN

aware of spatial information. CNN based methods transform one skeleton se-

quence to a pseudo-image via hand-crafted manners [45,18,22,28,13,11,21], and

then use popular networks to learn spatial and temporal dynamics in it.

The appearance of GCN based methods, like ST-GCN [46], enables more

natural spatial topology representation of skeleton joints by organizing them as

a non-Euclidean graph. The spatial correlation is modelled for bone-connected

joints. As the ﬁxed graph topology (or adjacency matrix) is not ﬂexible to model

the dependencies among spatially disconnected joints, many subsequent methods

focus on designing high-order or multi-scale adjacency matrix [23,12,30,15,20],

and dynamically adjusted graph topology [37,23,48,50,5]. Nevertheless, these

manually devised joint traversal rules limit the ﬂexibility to learn more eﬀective

spatial-temporal dynamics of joints for action recognition.

Transformer based Methods. Several recent works extend Transformer [41]

to spatial and temporal dimensions of skeleton-based action recognition. Among

them, DSTA [38] is the ﬁrst to use self-attention to learn joint relations, whereas

in practice spatial transformer interleaved with temporal convolution is em-

ployed for some typical datasets. ST-TR [33] adopts a hybrid architecture of

GCN and transformer in a two-stream network, with each stream replacing the

GCN or temporal convolution with spatial or temporal self-attention. STST [51]

introduces a transformer network that the spatial and temporal dimensions are

parallelly separated. Besides, the network is trained together with multi-task

self-supervised learning tasks.

3 Proposed Method

In this section, we ﬁrst brieﬂy review the basic spatial and temporal Transformer

blocks (referred to as Basic-SFormer and Basic-TFormer blocks respectively)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FocalandGlobalSpatial-TemporalTransformerforSkeleton-basedActionRecognitionZhiminGao1,PeitaoWang1,PeiLv1,XiaohengJiang1,QidongLiu1,PichaoWang2,MingliangXu1?,andWanqingLi31ZhengzhouUniversity,Zhengzhou,Chinazhimingao113@gmail.com,wptao98@163.com,fielvpei,jiangxiaoheng,ieqdliu,iexumingliangg@zzu.edu.c...

展开>> 收起<<

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition Zhimin Gao1 Peitao Wang1 Pei Lv1 Xiaoheng Jiang1 Qidong Liu1 Pichao.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition Zhimin Gao1 Peitao Wang1 Pei Lv1 Xiaoheng Jiang1 Qidong Liu1 Pichao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: