2 Z. Gao et al.
computer interaction, intelligent video surveillance and robotics [34,4,44]. In re-
cent years, skeleton-based action recognition has gained increasing attention with
advent of cost-effective depth cameras like Microsoft Kinect [52] and advanced
pose estimation techniques [2], which make skeleton data more accurate and ac-
cessible. By representing the action as a sequence of joint coordinates of human
body, the highly abstracted skeleton data is compact and robust to illumination,
human appearance changes and background noises.
Effectively modelling the spatial-temporal correlations and dynamics of joints
is crucial for recognizing actions from skeleton sequences. The dominant solu-
tions to it in recent years are the graph convolutional networks (GCNs) [46],
as they can model the irregular topology of the human skeleton. Via design-
ing advanced graph topology or traversal rules, the recognition performance is
greatly improved by GCN-based methods [30,40]. Meanwhile, the recent suc-
cess of Transformer [41] has gained significant interest and performance boost
in various computer vision tasks [9,29,3,32]. For skeleton-based action recogni-
tion, one would expect that the self-attention mechanism in transformer shall
naturally capture effective correlations of joints in both spatial and temporal di-
mensions for action categorization, without enforcing the articulating constrains
of human body like GCN. However, there are only a few transformer-based at-
tempts [38,33,51], and they devise hybrid model of GCN and transformer [33]
or multi-task learning framework [51]. How to utilize self-attention to learn ef-
fective spatial-temporal relations of joints and representative motion features
is still a thorny problem. Moreover, most of these Transformer based methods
directly calculate the global one-to-one relations of joints for spatial and tempo-
ral dimensions respectively. Such strategy undervalues the spatial interactions
of discriminative local joints and short-term temporal dynamics for identifying
crucial action-related patterns. On the one hand, since not all joints are infor-
mative for recognizing actions [27,16], these methods suffer from the influence
of irrelevant or noisy joints by accumulating the correlations with them via at-
tention mechanism, which could harm the recognition. On the other hand, with
the fact that the vanilla transformer lacks of inductive bias [29] to capture the
locality of temporal structural data, it is difficult for these methods to directly
model effective temporal relations of joints globally over long input sequence.
To tackle these issues, we propose a novel end-to-end Focal and Global
Spatial-Temporal Transformer network, dubbed as FG-STFormer, to effec-
tively capture relations of the crucial local joints and the global contextual
information in both spatial and temporal dimensions for skeleton-based action
recognition. It is composed of two components: FG-SFormer and FG-TFormer.
Intuitively, each action can be distinguished by the co-movement of: (1) some
critical local joints, (2) global body parts, and (or) (3) joint-part interaction. For
example, as shown in Fig. 1, actions such as taking a selfie and kicking mainly
involve important joints of hands, head and feet, as well as related body parts
of arms and legs, while the actions like sit down primarily require understand-
ing of body parts cooperation and dynamics. Based on the above observations,
at the late stage of the network, we adaptively sample the informative spa-