DG-STGCN Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition Haodong Duan12 Jiaqi Wang2 Kai Chen2 and Dahua Lin12

2025-04-26 0 0 1.83MB 17 页 10玖币

侵权投诉

DG-STGCN: Dynamic Spatial-Temporal

Modeling for Skeleton-based Action Recognition

Haodong Duan1,2, Jiaqi Wang2, Kai Chen2, and Dahua Lin1,2

1The Chinese University of Hong Kong

2Shanghai AI Lab

Abstract. Graph convolution networks (GCN) have been widely used

in skeleton-based action recognition. We note that existing GCN-based

approaches primarily rely on prescribed graphical structures (i.e., a man-

ually deﬁned topology of skeleton joints), which limits their ﬂexibility to

capture complicated correlations between joints. To move beyond this

limitation, we propose a new framework for skeleton-based action recog-

nition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN). It

consists of two modules, DG-GCN and DG-TCN, respectively, for spa-

tial and temporal modeling. In particular, DG-GCN uses learned aﬃnity

matrices to capture dynamic graphical structures instead of relying on

a prescribed one, while DG-TCN performs group-wise temporal convo-

lutions with varying receptive ﬁelds and incorporates a dynamic joint-

skeleton fusion module for adaptive multi-level temporal modeling. On

a wide range of benchmarks, including NTURGB+D, Kinetics-Skeleton,

BABEL, and Toyota SmartHome, DG-STGCN consistently outperforms

state-of-the-art methods, often by a notable margin.

Keywords: Skeleton-based Action Recognition, Dynamic, GCN.

1 Introduction

Human action recognition is a central task in video understanding. For videos,

various modalities derived from the rich multimedia are beneﬁcial to the recog-

nition task, including RGB [4,13,15,16,46,47], optical ﬂow [32,36,42], and human

skeletons [8,12,29,39,53]. Among them, skeleton-based action recognition has at-

tracted increasing attention due to its action-focusing nature and robustness

against complicated backgrounds or diﬀerent lighting conditions. In contrast to

other modalities, skeleton data are highly compact and abstract, containing only

2D [4,21] or 3D [9,27,33,37] human joint coordinates. For action recognition, the

compactness of skeleton data leads to robust features and eﬃcient processing.

Since Yan et al. [53] ﬁrst proposes ST-GCN to model skeleton motion patterns

with a spatial-temporal graph, GCN-based approaches have quickly become the

most popular paradigm for skeleton-based action recognition. In ST-GCN, spa-

tial modeling and temporal modeling are performed separately by spatial graph

convolutions and temporal convolutions. Spatial graph convolutions fuse fea-

tures of diﬀerent joints according to a manually deﬁned topology, presented as

arXiv:2210.05895v1 [cs.CV] 12 Oct 2022

2 Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin

NTU120-XSub Top-1 (%) NTU120-XSet Top-1 (%)

GFLOPs / Clip GFLOPs / Clip

ST-GCN AGCN MS-G3D CTR-GCN DG-STGCN

Fig. 1: GFLOPs vs. accuracies on two NTURGB+D 120 benchmarks3.

a sparse adjacency matrix indicating the direct connections between skeleton

joints. For temporal modeling, temporal 1D convolutions are applied to each

joint in parallel to model the joint-speciﬁc motion patterns.

Following works [5,39,57] propose to improve the spatial modeling with a

learnable topology reﬁnement. Though demonstrating good eﬀectiveness, these

approaches still heavily rely on the manually deﬁned skeleton topology, requir-

ing painstaking customizations for diﬀerent datasets. More importantly, due to

the intrinsic cooperation among joints when performing actions, fully learnable

coeﬃcient matrices, rather than a prescribed graphical structure, are more suit-

able to model complicated joint correlations. Meanwhile, previous GCN-based

approaches mostly utilize temporal convolutions with a ﬁxed receptive ﬁeld to

model the joint-level motions in a certain temporal range, overlooking the ben-

eﬁts of modeling multi-level (i.e. joint-level +skeleton-level) motion patterns

within a dynamic temporal receptive ﬁeld.

Revisiting the limitations of current works, we propose a novel GCN architec-

ture for skeleton-based action recognition, namely DG-STGCN. It enables group-

wise dynamic spatial-temporal modeling for skeleton data. In DG-STGCN, spa-

tial modeling and temporal modeling are performed separately by dynamic group

GCNs (DG-GCNs) and dynamic group temporal ConvNets (DG-TCNs), re-

spectively. Our proposed framework has three appealing properties. First, DG-

STGCN is purely based on learnable coeﬃcient matrices for spatial modeling,

eliminating the cumbersome procedure to deﬁne a good joint topology manually.

Second, the dynamic group-wise design enables the dynamic spatial-temporal

modeling of the skeleton motion with diversiﬁed groups of graph convolutions

and temporal convolutions, improving the representation capability and ﬂexibil-

ity. Third, DG-STGCN achieves signiﬁcant improvements on multiple bench-

marks while preserving the model eﬃciency.

In DG-STGCN, both DG modules ﬁrst transform the input skeleton features

to Nfeature groups (channel width reduced to 1/N), and then perform spa-

tial or temporal modeling independently for each feature group. In DG-GCN,

each feature group has its own dynamic coeﬃcient matrix4for inter-joint spa-

tial modeling. Each coeﬃcient matrix is a dynamic summation of three data-

3We report the single model accuracy trained on the joint modality. Details in Table 4.

4N= 8 in experiments. For a manually deﬁned topology, Ncan only be 1 or 3.

DG-STGCN 3

driven components, including one shared matrix that models the joint correla-

tions across all samples (the static component) and two sample-speciﬁc matri-

ces (the dynamic components) with diﬀerent designs. In experiments, DG-GCN

demonstrates great capability, surpassing multiple variants of graph convolutions

in both eﬃcacy and eﬃciency. For temporal modeling, we propose a dynamic

group temporal ConvNet (DG-TCN) with diversiﬁed receptive ﬁelds, which fur-

ther adopts a dynamic joint-skeleton fusion (D-JSF) module to fuse both joint-

level and skeleton-level motion patterns (D-JSF) in various temporal ranges.

Speciﬁcally, it models joint-level and skeleton-level features parallelly with a

multi-group temporal ConvNet, where each group extracts motion patterns in

a diﬀerent temporal receptive ﬁeld. The joint-level and skeleton-level features

are dynamically fused into the joint-skeleton motion features with learnable

joint-speciﬁc coeﬃcients. With computational eﬃciency preserved, DG-TCN is

of great temporal modeling capability. Meanwhile, the proposed D-JSF module

masters multi-level temporal modeling with negligible additional cost.

With improved ﬂexibility comes increased overﬁtting risk. We further pro-

pose to adopt Uniform Sampling as a strong temporal data augmentation strat-

egy, which randomly samples a subsequence from the skeleton sequence data

to generate highly diversiﬁed training samples. The strategy leads to consis-

tent improvements across multiple backbones and benchmarks and is especially

beneﬁcial for DG-STGCN.

Extensive experiment results highlight the eﬀectiveness of the dynamic group

design and the good practices we proposed. Our DG-STGCN outperforms all

previous state-of-the-art methods across multiple skeleton-based action recogni-

tion benchmarks, including NTURGB+D [27,37], Kinetics-Skeleton [4,53], BA-

BEL [33], and Toyota SmartHome [9].

2 Related Works

2.1 Graph Neural Networks

To process non-Euclidean structured data (like arbitrarily structured graphs),

Graph Neural Networks [1,10,17,18,24,48,51] (GNNs) are widely adopted and

extensively explored. GNNs can be generally categorized as spectral GNNs and

spatial GNNs. Spectral GNNs [1,18,19] apply convolutions on the specrtal do-

main. They assume ﬁxed adjacency among all samples, limiting the generaliz-

ability to unseen graph structures. Spatial GNNs [14,17,48], in contrast, perform

layer-wise feature updates for each node by local feature fusion and activation.

Most GCN approaches for skeleton-based action recognition follow the spirit of

Spatial GNNs. They construct a spatial-temporal graph based on the skeleton

data, apply convolutions for feature aggregation in a neighborhood, and perform

layer-wise feature updates.

2.2 Skeleton-based Action Recognition

Human skeletons are robust against backgrounds or illumination changes and

can also be obtained with sensors [60] or pose estimators [3,34,45]. Various

4 Haodong Duan, Jiaqi Wang, Kai Chen, Dahua Lin

paradigms have been designed for skeleton-based action recognition. Early ap-

proaches [20,49,50] design hand-crafted features to capture joint motion pat-

terns and feed them directly to downstream classiﬁers. With the prosperity

of deep learning, the following methods consider skeleton data as time series

and process them using Recurrent Neural Networks [28,37,43,59,61] and Tempo-

ral Convolution Networks [22,23]. However, these approaches do not model the

joint relationships explicitly, leading to inferior recognition performance. To miti-

gate this, GCN-based approaches [5,29,39,53] construct spatial-temporal graphs,

which separately perform spatial modeling and temporal modeling.

An early application of GCN on skeleton-based action recognition is ST-

GCN [53]. ST-GCN uses stacked GCN blocks to process skeleton data, while

each block consists of a spatial module and a temporal module. Being a sim-

ple baseline, ST-GCN’s instantiations of the spatial module and the temporal

module are straightforward: the spatial module adopts sparse coeﬃcient ma-

trices derived from a pre-deﬁned adjacency matrix for spatial feature fusion;

while the temporal module uses a single 1D convolution for temporal modeling.

Following works inherit the main framework of ST-GCN and develop various

incremental improvements on the design of spatial and temporal modules. For

temporal modeling, the advanced multi-scale TCN [5,29] is used to replace the

naive implementation, which is capable of modeling actions with multiple dura-

tions. Despite the improved capacity, the temporal module performs joint-level

motion modeling as before, which we ﬁnd insuﬃcient. For spatial modeling,

a series of works propose to learn a data-driven reﬁnement for the prescribed

graphical structure. The reﬁnement can be either channel-agnostic [39,58] (ob-

tained with self-attention mechanism) or channel-speciﬁc [5,6]. Meanwhile, Liu

et al. [29] introduces multi-scale graph topologies for joint relationship model-

ing with diﬀerent ranges. There also exist spatial modeling modules that do not

require a pre-deﬁned topology. Shift-GCN [7] adopts graph shift for inter-joint

feature fusion, while SGN [58] estimates the coeﬃcient matrices based on the in-

put skeleton sequence with a lightweight attention module. However, the purely

dynamic approaches are less competitive and can not achieve the state-of-the-

art. In this work, we design the dynamic group GCN that learns the spatial

feature fusion strategy from scratch and does not rely on a prescribed graphical

structure. Our DG-GCN outperforms various alternatives in the ablation study.

Besides GCNs, another stream of work adopts convolutional neural networks

(CNN) for skeleton-based action recognition. Approaches based on 2D-CNN con-

vert a skeleton sequence to a pseudo image and process it with 2D CNN. This

can be obtained by: 1) directly generating a 2D input of shape V×T[2,22,25]

(V: number of joints; T: temporal length) given joint coordinates; 2) generating

pseudo heatmaps for joints (2D only) and aggregating them along the temporal

dimension with color coding [8] or learnable modules [52] to form a K-channel

pseudo image. These approaches either fail to exploit the locality nature of CNN

or suﬀer from the information loss during heatmap aggregation, thus perform-

ing worse than representative GCN-based approaches. Recently, PoseC3D [12]

proposes to stack heatmaps along the temporal dimension and use 3D-CNN for

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DG-STGCN:DynamicSpatial-TemporalModelingforSkeleton-basedActionRecognitionHaodongDuan1,2,JiaqiWang2,KaiChen2,andDahuaLin1,21TheChineseUniversityofHongKong2ShanghaiAILabAbstract.Graphconvolutionnetworks(GCN)havebeenwidelyusedinskeleton-basedactionrecognition.WenotethatexistingGCN-basedapproachesprima...

展开>> 收起<<

DG-STGCN Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition Haodong Duan12 Jiaqi Wang2 Kai Chen2 and Dahua Lin12.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DG-STGCN Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition Haodong Duan12 Jiaqi Wang2 Kai Chen2 and Dahua Lin12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: