DG-STGCN Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition Haodong Duan12 Jiaqi Wang2 Kai Chen2 and Dahua Lin12

2025-04-26 0 0 1.83MB 17 页 10玖币
侵权投诉
DG-STGCN: Dynamic Spatial-Temporal
Modeling for Skeleton-based Action Recognition
Haodong Duan1,2, Jiaqi Wang2, Kai Chen2, and Dahua Lin1,2
1The Chinese University of Hong Kong
2Shanghai AI Lab
Abstract. Graph convolution networks (GCN) have been widely used
in skeleton-based action recognition. We note that existing GCN-based
approaches primarily rely on prescribed graphical structures (i.e., a man-
ually defined topology of skeleton joints), which limits their flexibility to
capture complicated correlations between joints. To move beyond this
limitation, we propose a new framework for skeleton-based action recog-
nition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It
consists of two modules, DG-GCN and DG-TCN, respectively, for spa-
tial and temporal modeling. In particular, DG-GCN uses learned affinity
matrices to capture dynamic graphical structures instead of relying on
a prescribed one, while DG-TCN performs group-wise temporal convo-
lutions with varying receptive fields and incorporates a dynamic joint-
skeleton fusion module for adaptive multi-level temporal modeling. On
a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton,
BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms
state-of-the-art methods, often by a notable margin.
Keywords: Skeleton-based Action Recognition, Dynamic, GCN.
1 Introduction
Human action recognition is a central task in video understanding. For videos,
various modalities derived from the rich multimedia are beneficial to the recog-
nition task, including RGB [4,13,15,16,46,47], optical flow [32,36,42], and human
skeletons [8,12,29,39,53]. Among them, skeleton-based action recognition has at-
tracted increasing attention due to its action-focusing nature and robustness
against complicated backgrounds or different lighting conditions. In contrast to
other modalities, skeleton data are highly compact and abstract, containing only
2D [4,21] or 3D [9,27,33,37] human joint coordinates. For action recognition, the
compactness of skeleton data leads to robust features and efficient processing.
Since Yan et al. [53] first proposes ST-GCN to model skeleton motion patterns
with a spatial-temporal graph, GCN-based approaches have quickly become the
most popular paradigm for skeleton-based action recognition. In ST-GCN, spa-
tial modeling and temporal modeling are performed separately by spatial graph
convolutions and temporal convolutions. Spatial graph convolutions fuse fea-
tures of different joints according to a manually defined topology, presented as
arXiv:2210.05895v1 [cs.CV] 12 Oct 2022
2 Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin
NTU120-XSub Top-1 (%) NTU120-XSet Top-1 (%)
GFLOPs / Clip GFLOPs / Clip
ST-GCN AGCN MS-G3D CTR-GCN DG-STGCN
Fig. 1: GFLOPs vs. accuracies on two NTURGB+D 120 benchmarks3.
a sparse adjacency matrix indicating the direct connections between skeleton
joints. For temporal modeling, temporal 1D convolutions are applied to each
joint in parallel to model the joint-specific motion patterns.
Following works [5,39,57] propose to improve the spatial modeling with a
learnable topology refinement. Though demonstrating good effectiveness, these
approaches still heavily rely on the manually defined skeleton topology, requir-
ing painstaking customizations for different datasets. More importantly, due to
the intrinsic cooperation among joints when performing actions, fully learnable
coefficient matrices, rather than a prescribed graphical structure, are more suit-
able to model complicated joint correlations. Meanwhile, previous GCN-based
approaches mostly utilize temporal convolutions with a fixed receptive field to
model the joint-level motions in a certain temporal range, overlooking the ben-
efits of modeling multi-level (i.e. joint-level +skeleton-level) motion patterns
within a dynamic temporal receptive field.
Revisiting the limitations of current works, we propose a novel GCN architec-
ture for skeleton-based action recognition, namely DG-STGCN. It enables group-
wise dynamic spatial-temporal modeling for skeleton data. In DG-STGCN, spa-
tial modeling and temporal modeling are performed separately by dynamic group
GCNs (DG-GCNs) and dynamic group temporal ConvNets (DG-TCNs), re-
spectively. Our proposed framework has three appealing properties. First, DG-
STGCN is purely based on learnable coefficient matrices for spatial modeling,
eliminating the cumbersome procedure to define a good joint topology manually.
Second, the dynamic group-wise design enables the dynamic spatial-temporal
modeling of the skeleton motion with diversified groups of graph convolutions
and temporal convolutions, improving the representation capability and flexibil-
ity. Third, DG-STGCN achieves significant improvements on multiple bench-
marks while preserving the model efficiency.
In DG-STGCN, both DG modules first transform the input skeleton features
to Nfeature groups (channel width reduced to 1/N), and then perform spa-
tial or temporal modeling independently for each feature group. In DG-GCN,
each feature group has its own dynamic coefficient matrix4for inter-joint spa-
tial modeling. Each coefficient matrix is a dynamic summation of three data-
3We report the single model accuracy trained on the joint modality. Details in Table 4.
4N= 8 in experiments. For a manually defined topology, Ncan only be 1 or 3.
DG-STGCN 3
driven components, including one shared matrix that models the joint correla-
tions across all samples (the static component) and two sample-specific matri-
ces (the dynamic components) with different designs. In experiments, DG-GCN
demonstrates great capability, surpassing multiple variants of graph convolutions
in both efficacy and efficiency. For temporal modeling, we propose a dynamic
group temporal ConvNet (DG-TCN) with diversified receptive fields, which fur-
ther adopts a dynamic joint-skeleton fusion (D-JSF) module to fuse both joint-
level and skeleton-level motion patterns (D-JSF) in various temporal ranges.
Specifically, it models joint-level and skeleton-level features parallelly with a
multi-group temporal ConvNet, where each group extracts motion patterns in
a different temporal receptive field. The joint-level and skeleton-level features
are dynamically fused into the joint-skeleton motion features with learnable
joint-specific coefficients. With computational efficiency preserved, DG-TCN is
of great temporal modeling capability. Meanwhile, the proposed D-JSF module
masters multi-level temporal modeling with negligible additional cost.
With improved flexibility comes increased overfitting risk. We further pro-
pose to adopt Uniform Sampling as a strong temporal data augmentation strat-
egy, which randomly samples a subsequence from the skeleton sequence data
to generate highly diversified training samples. The strategy leads to consis-
tent improvements across multiple backbones and benchmarks and is especially
beneficial for DG-STGCN.
Extensive experiment results highlight the effectiveness of the dynamic group
design and the good practices we proposed. Our DG-STGCN outperforms all
previous state-of-the-art methods across multiple skeleton-based action recogni-
tion benchmarks, including NTURGB+D [27,37], Kinetics-Skeleton [4,53], BA-
BEL [33], and Toyota SmartHome [9].
2 Related Works
2.1 Graph Neural Networks
To process non-Euclidean structured data (like arbitrarily structured graphs),
Graph Neural Networks [1,10,17,18,24,48,51] (GNNs) are widely adopted and
extensively explored. GNNs can be generally categorized as spectral GNNs and
spatial GNNs. Spectral GNNs [1,18,19] apply convolutions on the specrtal do-
main. They assume fixed adjacency among all samples, limiting the generaliz-
ability to unseen graph structures. Spatial GNNs [14,17,48], in contrast, perform
layer-wise feature updates for each node by local feature fusion and activation.
Most GCN approaches for skeleton-based action recognition follow the spirit of
Spatial GNNs. They construct a spatial-temporal graph based on the skeleton
data, apply convolutions for feature aggregation in a neighborhood, and perform
layer-wise feature updates.
2.2 Skeleton-based Action Recognition
Human skeletons are robust against backgrounds or illumination changes and
can also be obtained with sensors [60] or pose estimators [3,34,45]. Various
4 Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin
paradigms have been designed for skeleton-based action recognition. Early ap-
proaches [20,49,50] design hand-crafted features to capture joint motion pat-
terns and feed them directly to downstream classifiers. With the prosperity
of deep learning, the following methods consider skeleton data as time series
and process them using Recurrent Neural Networks [28,37,43,59,61] and Tempo-
ral Convolution Networks [22,23]. However, these approaches do not model the
joint relationships explicitly, leading to inferior recognition performance. To miti-
gate this, GCN-based approaches [5,29,39,53] construct spatial-temporal graphs,
which separately perform spatial modeling and temporal modeling.
An early application of GCN on skeleton-based action recognition is ST-
GCN [53]. ST-GCN uses stacked GCN blocks to process skeleton data, while
each block consists of a spatial module and a temporal module. Being a sim-
ple baseline, ST-GCN’s instantiations of the spatial module and the temporal
module are straightforward: the spatial module adopts sparse coefficient ma-
trices derived from a pre-defined adjacency matrix for spatial feature fusion;
while the temporal module uses a single 1D convolution for temporal modeling.
Following works inherit the main framework of ST-GCN and develop various
incremental improvements on the design of spatial and temporal modules. For
temporal modeling, the advanced multi-scale TCN [5,29] is used to replace the
naive implementation, which is capable of modeling actions with multiple dura-
tions. Despite the improved capacity, the temporal module performs joint-level
motion modeling as before, which we find insufficient. For spatial modeling,
a series of works propose to learn a data-driven refinement for the prescribed
graphical structure. The refinement can be either channel-agnostic [39,58] (ob-
tained with self-attention mechanism) or channel-specific [5,6]. Meanwhile, Liu
et al. [29] introduces multi-scale graph topologies for joint relationship model-
ing with different ranges. There also exist spatial modeling modules that do not
require a pre-defined topology. Shift-GCN [7] adopts graph shift for inter-joint
feature fusion, while SGN [58] estimates the coefficient matrices based on the in-
put skeleton sequence with a lightweight attention module. However, the purely
dynamic approaches are less competitive and can not achieve the state-of-the-
art. In this work, we design the dynamic group GCN that learns the spatial
feature fusion strategy from scratch and does not rely on a prescribed graphical
structure. Our DG-GCN outperforms various alternatives in the ablation study.
Besides GCNs, another stream of work adopts convolutional neural networks
(CNN) for skeleton-based action recognition. Approaches based on 2D-CNN con-
vert a skeleton sequence to a pseudo image and process it with 2D CNN. This
can be obtained by: 1) directly generating a 2D input of shape V×T[2,22,25]
(V: number of joints; T: temporal length) given joint coordinates; 2) generating
pseudo heatmaps for joints (2D only) and aggregating them along the temporal
dimension with color coding [8] or learnable modules [52] to form a K-channel
pseudo image. These approaches either fail to exploit the locality nature of CNN
or suffer from the information loss during heatmap aggregation, thus perform-
ing worse than representative GCN-based approaches. Recently, PoseC3D [12]
proposes to stack heatmaps along the temporal dimension and use 3D-CNN for
摘要:

DG-STGCN:DynamicSpatial-TemporalModelingforSkeleton-basedActionRecognitionHaodongDuan1,2,JiaqiWang2,KaiChen2,andDahuaLin1,21TheChineseUniversityofHongKong2ShanghaiAILabAbstract.Graphconvolutionnetworks(GCN)havebeenwidelyusedinskeleton-basedactionrecognition.WenotethatexistingGCN-basedapproachesprima...

展开>> 收起<<
DG-STGCN Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition Haodong Duan12 Jiaqi Wang2 Kai Chen2 and Dahua Lin12.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.83MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注