4 Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin
paradigms have been designed for skeleton-based action recognition. Early ap-
proaches [20,49,50] design hand-crafted features to capture joint motion pat-
terns and feed them directly to downstream classifiers. With the prosperity
of deep learning, the following methods consider skeleton data as time series
and process them using Recurrent Neural Networks [28,37,43,59,61] and Tempo-
ral Convolution Networks [22,23]. However, these approaches do not model the
joint relationships explicitly, leading to inferior recognition performance. To miti-
gate this, GCN-based approaches [5,29,39,53] construct spatial-temporal graphs,
which separately perform spatial modeling and temporal modeling.
An early application of GCN on skeleton-based action recognition is ST-
GCN [53]. ST-GCN uses stacked GCN blocks to process skeleton data, while
each block consists of a spatial module and a temporal module. Being a sim-
ple baseline, ST-GCN’s instantiations of the spatial module and the temporal
module are straightforward: the spatial module adopts sparse coefficient ma-
trices derived from a pre-defined adjacency matrix for spatial feature fusion;
while the temporal module uses a single 1D convolution for temporal modeling.
Following works inherit the main framework of ST-GCN and develop various
incremental improvements on the design of spatial and temporal modules. For
temporal modeling, the advanced multi-scale TCN [5,29] is used to replace the
naive implementation, which is capable of modeling actions with multiple dura-
tions. Despite the improved capacity, the temporal module performs joint-level
motion modeling as before, which we find insufficient. For spatial modeling,
a series of works propose to learn a data-driven refinement for the prescribed
graphical structure. The refinement can be either channel-agnostic [39,58] (ob-
tained with self-attention mechanism) or channel-specific [5,6]. Meanwhile, Liu
et al. [29] introduces multi-scale graph topologies for joint relationship model-
ing with different ranges. There also exist spatial modeling modules that do not
require a pre-defined topology. Shift-GCN [7] adopts graph shift for inter-joint
feature fusion, while SGN [58] estimates the coefficient matrices based on the in-
put skeleton sequence with a lightweight attention module. However, the purely
dynamic approaches are less competitive and can not achieve the state-of-the-
art. In this work, we design the dynamic group GCN that learns the spatial
feature fusion strategy from scratch and does not rely on a prescribed graphical
structure. Our DG-GCN outperforms various alternatives in the ablation study.
Besides GCNs, another stream of work adopts convolutional neural networks
(CNN) for skeleton-based action recognition. Approaches based on 2D-CNN con-
vert a skeleton sequence to a pseudo image and process it with 2D CNN. This
can be obtained by: 1) directly generating a 2D input of shape V×T[2,22,25]
(V: number of joints; T: temporal length) given joint coordinates; 2) generating
pseudo heatmaps for joints (2D only) and aggregating them along the temporal
dimension with color coding [8] or learnable modules [52] to form a K-channel
pseudo image. These approaches either fail to exploit the locality nature of CNN
or suffer from the information loss during heatmap aggregation, thus perform-
ing worse than representative GCN-based approaches. Recently, PoseC3D [12]
proposes to stack heatmaps along the temporal dimension and use 3D-CNN for