Temporal Feature Alignment in Contrastive Self-Supervised Learning for Human Activity Recognition Bulat Khaertdinov and Stylianos Asteriadis

2025-05-02 0 0 1.26MB 9 页 10玖币
侵权投诉
Temporal Feature Alignment in Contrastive Self-Supervised Learning for
Human Activity Recognition
Bulat Khaertdinov and Stylianos Asteriadis
Department of Advanced Computing Sciences, Maastricht University
Maastricht, Netherlands
{b.khaertdinov, stelios.asteriadis}@maastrichtuniversity.nl
Abstract
Automated Human Activity Recognition has long been
a problem of great interest in human-centered and ubiq-
uitous computing. In the last years, a plethora of super-
vised learning algorithms based on deep neural networks
has been suggested to address this problem using various
modalities. While every modality has its own limitations,
there is one common challenge. Namely, supervised learn-
ing requires vast amounts of annotated data which is prac-
tically hard to collect. In this paper, we benefit from the
self-supervised learning paradigm (SSL) that is typically
used to learn deep feature representations from unlabeled
data. Moreover, we upgrade a contrastive SSL framework,
namely SimCLR, widely used in various applications by in-
troducing a temporal feature alignment procedure for Hu-
man Activity Recognition. Specifically, we propose integrat-
ing a dynamic time warping (DTW) algorithm in a latent
space to force features to be aligned in a temporal dimen-
sion. Extensive experiments have been conducted for the
unimodal scenario with inertial modality as well as in mul-
timodal settings using inertial and skeleton data. According
to the obtained results, the proposed approach has a great
potential in learning robust feature representations com-
pared to the recent SSL baselines, and clearly outperforms
supervised models in semi-supervised learning. The code
for the unimodal case is available via the following link:
https://github.com/bulatkh/csshar_tfa.
1. Introduction
Movements and activities of humans provide crucial in-
formation that can be used to understand their habits and
motives, monitor physical and mental health, analyze their
performance in specific tasks as well as assist their daily
life. The task of recognizing activities of people using data
describing their movement is generally known as Human
Activity Recognition (HAR). HAR is a problem that has
been extensively addressed in many areas, such as smart
homes [32], health monitoring [29] and manufacturing au-
tomation [10].
Human Activity Recognition could be tackled using dif-
ferent sources of input data, such as wearable devices,
RGB-D streams, skeletal joints and others. While meth-
ods based on individual modalities have their own weak-
nesses that can be neglected by the multimodal approaches,
there is a significant practical challenge that has to be ad-
dressed. Namely, data annotation is an expensive and time-
consuming process, especially for multimodal approaches.
What is more, the recent supervised models are trained on
large amounts of annotated data.
Recent advances in representation learning have given
rise to a new family of techniques, namely, Self-Supervised
Learning (SSL). Through an SSL strategy, the goal is to
build data representations in the absence of annotated sam-
ples by solving an auxiliary pre-text task. Such represen-
tations are subsequently utilized as inputs in order to train
small-scale classifiers during the fine-tuning phase, neces-
sitating only limited amounts of annotated data.
In the last years, contrastive SSL frameworks have drawn
attention of researchers by demonstrating the state-of-the-
art performance in various application areas [4, 5]. The
main idea of contrastive learning is to train a feature en-
coder to group semantically similar, or positive, data points
together in a latent space, and push away the negative ones.
When data annotations are not available, contrastive learn-
ing uses two different representations of the same input in-
stance as a positive pair. Different representations of data
could be obtained using augmentations in the unimodal
case, while in the multimodal case these representations
could be derived from different modalities [34].
In this paper, inspired by [15], we aim to upgrade the
contrastive learning frameworks and make use of the tem-
poral nature of HAR task by introducing a Temporal Fea-
ture Alignment (TFA) algorithm that can be integrated into
these frameworks. Specifically, the contributions of this pa-
per could be summarized as follows:
arXiv:2210.03382v1 [cs.CV] 7 Oct 2022
We propose to integrate a differentiable version of the
Dynamic Time Warping (DTW) algorithm into con-
trastive learning frameworks applied to Human Activ-
ity Recognition to force alignment of features along
the temporal dimension.
The proposed method is applicable on both unimodal
and multimodal contrastive learning problems. In
particular, we integrated them into the SimCLR [4]
and Contrastive Multiview Coding (CMC) [34] frame-
works.
Extensive experiments have been conducted on uni-
modal sensor-based and multimodal (inertial and
skeleton) HAR datasets. The obtained results have
shown that the proposed method improves feature rep-
resentation learning comparing with the recent SSL
baselines, and, most importantly, with SimCLR and
CMC, in multiple evaluation scenarios.
2. Related Work
2.1. Unimodal and Multimodal HAR
Most of the algorithms applied on sensorial data ob-
tained using IMUs address HAR as a multivariate time-
series classification task. Hence, the applied deep learning
methods used for the problem include such architectures as
1D-CNNs [38], RNNs [11, 42] and their combinations [28].
Moreover, attention mechanisms have been exploited in var-
ious forms, such as sensor attention, temporal attention and
transformer self-attention [40, 26, 20]. Feature encoders for
skeleton modality are typically based on either 2D-CNNs,
RNNs or Graph Neural Networks [24, 33, 37, 6]. In this pa-
per we use a transformer-like architecture to encode inertial
data and an adaptation of a so-called co-occurence feature
learning architecture from [24] for skeleton modality.
Mutlimodal HAR approaches apply modality fusion on
varying levels, such on the raw data, on unimodal weak
decisions, or on feature representations [21, 27, 8]. Other
recent works propose more sophisticated end-to-end archi-
tectures that are crafted specifically for multimodal HAR.
These approaches make use of recent advances in deep
learning, such as GANs to generate feature representations
of one modality given features from another [36], various
attention-based mechanisms to fuse different modalities to-
gether [17, 18], or knowledge distillation techniques [25].
In this paper, we make use of simple feature level fusion
during fine-tuning in order to fairly assess the quality of fea-
ture representations learnt by individual encoders in a SSL
manner.
2.2. Contrastive Self-supervised Learning
In the recent years, contrastive learning methods have
shown the impressive performance in various applications,
including unimodal activity recognition [14, 19], by nar-
rowing the gap between supervised and self-supervised
methods. The main idea of this family of SSL methods
is similar to metric learning, namely encoders are trained
to group semantically similar, positive, examples. In case
when no labels are available, positive examples are formed
by crafting two different views from each instance. More-
over, some of the approaches use negative pairs that are used
to avoid trivial collapsing solutions [34, 4, 16], while others
propose not using them by introducing various schemes to
prevent the trivial shortcuts [9, 5]. A recent study on video
understanding by Haresh and Kumar et al. [15] proposed
to align semantically similar video frames in time using the
soft Dynamic Time Warping algorithm [7] and additional
regularization that prevents their encoders from collapsing.
Inspired by this idea, in this paper, we propose to exploit
the nature of data used for HAR and attempt aligning fea-
tures along temporal dimension by integrating a soft version
of DTW into contrastive learning frameworks used in uni-
modal and multimodal settings, namely SimCLR [4] and
CMC [34].
3. Methodology
3.1. Problem Definition
Inertial signals, used in sensor-based HAR, are nor-
mally obtained from wearable sensors such as accelerom-
eters, gyroscopes and others. Thus, sensor-based HAR can
be considered a multivariate time-series classification task.
Specifically, at timestamp t, the input signal is defined as
xt= [x1
t, x2
t, . . . , xS
t]RS, where Sis a number of chan-
nels. Hence, a time-window with signals aggregated for T
timesteps can be written as Xi= [x1,x2,...,xT]. Fi-
nally, the goal is to associate each time-window with a cor-
rect output label yY.
Similarly, in this paper we address multimodal HAR us-
ing inertial sensors Xiand 2D skeletal joints Xs. Skeleton
data is generally represented as a set of coordinates tracked
over time for a number of keypoints on a body. For any
skeleton sequence, we denote Tas the number of frames in
the sequence, Jas the number of joints and C= 2 as the
number of data channels, or dimensionality of coordinates.
Then, a skeleton sequence XsRT×J×Cconsists of T
frames where the skeleton data for each frame is described
by Pt= [p1
t,p2
t,...,pJ
t]and pj
tRCis the position of
joint jat frame t.
3.2. Contrastive SSL for HAR
The vast majority of self-supervised learning frame-
works are divided into two stages, namely pre-training and
fine-tuning. The aim of pre-training, also referred to as a
pre-text task, is to train a feature encoder on an auxiliary
task derived from unlabeled data. In contrastive learning
摘要:

TemporalFeatureAlignmentinContrastiveSelf-SupervisedLearningforHumanActivityRecognitionBulatKhaertdinovandStylianosAsteriadisDepartmentofAdvancedComputingSciences,MaastrichtUniversityMaastricht,Netherlandsfb.khaertdinov,stelios.asteriadisg@maastrichtuniversity.nlAbstractAutomatedHumanActivityRecogni...

展开>> 收起<<
Temporal Feature Alignment in Contrastive Self-Supervised Learning for Human Activity Recognition Bulat Khaertdinov and Stylianos Asteriadis.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.26MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注