Temporal Feature Alignment in Contrastive Self-Supervised Learning for Human Activity Recognition Bulat Khaertdinov and Stylianos Asteriadis

2025-05-02 0 0 1.26MB 9 页 10玖币

侵权投诉

Temporal Feature Alignment in Contrastive Self-Supervised Learning for

Human Activity Recognition

Bulat Khaertdinov and Stylianos Asteriadis

Department of Advanced Computing Sciences, Maastricht University

Maastricht, Netherlands

{b.khaertdinov, stelios.asteriadis}@maastrichtuniversity.nl

Abstract

Automated Human Activity Recognition has long been

a problem of great interest in human-centered and ubiq-

uitous computing. In the last years, a plethora of super-

vised learning algorithms based on deep neural networks

has been suggested to address this problem using various

modalities. While every modality has its own limitations,

there is one common challenge. Namely, supervised learn-

ing requires vast amounts of annotated data which is prac-

tically hard to collect. In this paper, we beneﬁt from the

self-supervised learning paradigm (SSL) that is typically

used to learn deep feature representations from unlabeled

data. Moreover, we upgrade a contrastive SSL framework,

namely SimCLR, widely used in various applications by in-

troducing a temporal feature alignment procedure for Hu-

man Activity Recognition. Speciﬁcally, we propose integrat-

ing a dynamic time warping (DTW) algorithm in a latent

space to force features to be aligned in a temporal dimen-

sion. Extensive experiments have been conducted for the

unimodal scenario with inertial modality as well as in mul-

timodal settings using inertial and skeleton data. According

to the obtained results, the proposed approach has a great

potential in learning robust feature representations com-

pared to the recent SSL baselines, and clearly outperforms

supervised models in semi-supervised learning. The code

for the unimodal case is available via the following link:

https://github.com/bulatkh/csshar_tfa.

1. Introduction

Movements and activities of humans provide crucial in-

formation that can be used to understand their habits and

motives, monitor physical and mental health, analyze their

performance in speciﬁc tasks as well as assist their daily

life. The task of recognizing activities of people using data

describing their movement is generally known as Human

Activity Recognition (HAR). HAR is a problem that has

been extensively addressed in many areas, such as smart

homes [32], health monitoring [29] and manufacturing au-

tomation [10].

Human Activity Recognition could be tackled using dif-

ferent sources of input data, such as wearable devices,

RGB-D streams, skeletal joints and others. While meth-

ods based on individual modalities have their own weak-

nesses that can be neglected by the multimodal approaches,

there is a signiﬁcant practical challenge that has to be ad-

dressed. Namely, data annotation is an expensive and time-

consuming process, especially for multimodal approaches.

What is more, the recent supervised models are trained on

large amounts of annotated data.

Recent advances in representation learning have given

rise to a new family of techniques, namely, Self-Supervised

Learning (SSL). Through an SSL strategy, the goal is to

build data representations in the absence of annotated sam-

ples by solving an auxiliary pre-text task. Such represen-

tations are subsequently utilized as inputs in order to train

small-scale classiﬁers during the ﬁne-tuning phase, neces-

sitating only limited amounts of annotated data.

In the last years, contrastive SSL frameworks have drawn

attention of researchers by demonstrating the state-of-the-

art performance in various application areas [4, 5]. The

main idea of contrastive learning is to train a feature en-

coder to group semantically similar, or positive, data points

together in a latent space, and push away the negative ones.

When data annotations are not available, contrastive learn-

ing uses two different representations of the same input in-

stance as a positive pair. Different representations of data

could be obtained using augmentations in the unimodal

case, while in the multimodal case these representations

could be derived from different modalities [34].

In this paper, inspired by [15], we aim to upgrade the

contrastive learning frameworks and make use of the tem-

poral nature of HAR task by introducing a Temporal Fea-

ture Alignment (TFA) algorithm that can be integrated into

these frameworks. Speciﬁcally, the contributions of this pa-

per could be summarized as follows:

arXiv:2210.03382v1 [cs.CV] 7 Oct 2022

• We propose to integrate a differentiable version of the

Dynamic Time Warping (DTW) algorithm into con-

trastive learning frameworks applied to Human Activ-

ity Recognition to force alignment of features along

the temporal dimension.

• The proposed method is applicable on both unimodal

and multimodal contrastive learning problems. In

particular, we integrated them into the SimCLR [4]

and Contrastive Multiview Coding (CMC) [34] frame-

works.

• Extensive experiments have been conducted on uni-

modal sensor-based and multimodal (inertial and

skeleton) HAR datasets. The obtained results have

shown that the proposed method improves feature rep-

resentation learning comparing with the recent SSL

baselines, and, most importantly, with SimCLR and

CMC, in multiple evaluation scenarios.

2. Related Work

2.1. Unimodal and Multimodal HAR

Most of the algorithms applied on sensorial data ob-

tained using IMUs address HAR as a multivariate time-

series classiﬁcation task. Hence, the applied deep learning

methods used for the problem include such architectures as

1D-CNNs [38], RNNs [11, 42] and their combinations [28].

Moreover, attention mechanisms have been exploited in var-

ious forms, such as sensor attention, temporal attention and

transformer self-attention [40, 26, 20]. Feature encoders for

skeleton modality are typically based on either 2D-CNNs,

RNNs or Graph Neural Networks [24, 33, 37, 6]. In this pa-

per we use a transformer-like architecture to encode inertial

data and an adaptation of a so-called co-occurence feature

learning architecture from [24] for skeleton modality.

Mutlimodal HAR approaches apply modality fusion on

varying levels, such on the raw data, on unimodal weak

decisions, or on feature representations [21, 27, 8]. Other

recent works propose more sophisticated end-to-end archi-

tectures that are crafted speciﬁcally for multimodal HAR.

These approaches make use of recent advances in deep

learning, such as GANs to generate feature representations

of one modality given features from another [36], various

attention-based mechanisms to fuse different modalities to-

gether [17, 18], or knowledge distillation techniques [25].

In this paper, we make use of simple feature level fusion

during ﬁne-tuning in order to fairly assess the quality of fea-

ture representations learnt by individual encoders in a SSL

manner.

2.2. Contrastive Self-supervised Learning

In the recent years, contrastive learning methods have

shown the impressive performance in various applications,

including unimodal activity recognition [14, 19], by nar-

rowing the gap between supervised and self-supervised

methods. The main idea of this family of SSL methods

is similar to metric learning, namely encoders are trained

to group semantically similar, positive, examples. In case

when no labels are available, positive examples are formed

by crafting two different views from each instance. More-

over, some of the approaches use negative pairs that are used

to avoid trivial collapsing solutions [34, 4, 16], while others

propose not using them by introducing various schemes to

prevent the trivial shortcuts [9, 5]. A recent study on video

understanding by Haresh and Kumar et al. [15] proposed

to align semantically similar video frames in time using the

soft Dynamic Time Warping algorithm [7] and additional

regularization that prevents their encoders from collapsing.

Inspired by this idea, in this paper, we propose to exploit

the nature of data used for HAR and attempt aligning fea-

tures along temporal dimension by integrating a soft version

of DTW into contrastive learning frameworks used in uni-

modal and multimodal settings, namely SimCLR [4] and

CMC [34].

3. Methodology

3.1. Problem Deﬁnition

Inertial signals, used in sensor-based HAR, are nor-

mally obtained from wearable sensors such as accelerom-

eters, gyroscopes and others. Thus, sensor-based HAR can

be considered a multivariate time-series classiﬁcation task.

Speciﬁcally, at timestamp t, the input signal is deﬁned as

xt= [x1

t, x2

t, . . . , xS

t]∈RS, where Sis a number of chan-

nels. Hence, a time-window with signals aggregated for T

timesteps can be written as Xi= [x1,x2,...,xT]. Fi-

nally, the goal is to associate each time-window with a cor-

rect output label y∈Y.

Similarly, in this paper we address multimodal HAR us-

ing inertial sensors Xiand 2D skeletal joints Xs. Skeleton

data is generally represented as a set of coordinates tracked

over time for a number of keypoints on a body. For any

skeleton sequence, we denote Tas the number of frames in

the sequence, Jas the number of joints and C= 2 as the

number of data channels, or dimensionality of coordinates.

Then, a skeleton sequence Xs∈RT×J×Cconsists of T

frames where the skeleton data for each frame is described

by Pt= [p1

t,p2

t,...,pJ

t]and pj

t∈RCis the position of

joint jat frame t.

3.2. Contrastive SSL for HAR

The vast majority of self-supervised learning frame-

works are divided into two stages, namely pre-training and

ﬁne-tuning. The aim of pre-training, also referred to as a

pre-text task, is to train a feature encoder on an auxiliary

task derived from unlabeled data. In contrastive learning

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TemporalFeatureAlignmentinContrastiveSelf-SupervisedLearningforHumanActivityRecognitionBulatKhaertdinovandStylianosAsteriadisDepartmentofAdvancedComputingSciences,MaastrichtUniversityMaastricht,Netherlandsfb.khaertdinov,stelios.asteriadisg@maastrichtuniversity.nlAbstractAutomatedHumanActivityRecogni...

展开>> 收起<<

Temporal Feature Alignment in Contrastive Self-Supervised Learning for Human Activity Recognition Bulat Khaertdinov and Stylianos Asteriadis.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Temporal Feature Alignment in Contrastive Self-Supervised Learning for Human Activity Recognition Bulat Khaertdinov and Stylianos Asteriadis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: