• We propose to integrate a differentiable version of the
Dynamic Time Warping (DTW) algorithm into con-
trastive learning frameworks applied to Human Activ-
ity Recognition to force alignment of features along
the temporal dimension.
• The proposed method is applicable on both unimodal
and multimodal contrastive learning problems. In
particular, we integrated them into the SimCLR [4]
and Contrastive Multiview Coding (CMC) [34] frame-
works.
• Extensive experiments have been conducted on uni-
modal sensor-based and multimodal (inertial and
skeleton) HAR datasets. The obtained results have
shown that the proposed method improves feature rep-
resentation learning comparing with the recent SSL
baselines, and, most importantly, with SimCLR and
CMC, in multiple evaluation scenarios.
2. Related Work
2.1. Unimodal and Multimodal HAR
Most of the algorithms applied on sensorial data ob-
tained using IMUs address HAR as a multivariate time-
series classification task. Hence, the applied deep learning
methods used for the problem include such architectures as
1D-CNNs [38], RNNs [11, 42] and their combinations [28].
Moreover, attention mechanisms have been exploited in var-
ious forms, such as sensor attention, temporal attention and
transformer self-attention [40, 26, 20]. Feature encoders for
skeleton modality are typically based on either 2D-CNNs,
RNNs or Graph Neural Networks [24, 33, 37, 6]. In this pa-
per we use a transformer-like architecture to encode inertial
data and an adaptation of a so-called co-occurence feature
learning architecture from [24] for skeleton modality.
Mutlimodal HAR approaches apply modality fusion on
varying levels, such on the raw data, on unimodal weak
decisions, or on feature representations [21, 27, 8]. Other
recent works propose more sophisticated end-to-end archi-
tectures that are crafted specifically for multimodal HAR.
These approaches make use of recent advances in deep
learning, such as GANs to generate feature representations
of one modality given features from another [36], various
attention-based mechanisms to fuse different modalities to-
gether [17, 18], or knowledge distillation techniques [25].
In this paper, we make use of simple feature level fusion
during fine-tuning in order to fairly assess the quality of fea-
ture representations learnt by individual encoders in a SSL
manner.
2.2. Contrastive Self-supervised Learning
In the recent years, contrastive learning methods have
shown the impressive performance in various applications,
including unimodal activity recognition [14, 19], by nar-
rowing the gap between supervised and self-supervised
methods. The main idea of this family of SSL methods
is similar to metric learning, namely encoders are trained
to group semantically similar, positive, examples. In case
when no labels are available, positive examples are formed
by crafting two different views from each instance. More-
over, some of the approaches use negative pairs that are used
to avoid trivial collapsing solutions [34, 4, 16], while others
propose not using them by introducing various schemes to
prevent the trivial shortcuts [9, 5]. A recent study on video
understanding by Haresh and Kumar et al. [15] proposed
to align semantically similar video frames in time using the
soft Dynamic Time Warping algorithm [7] and additional
regularization that prevents their encoders from collapsing.
Inspired by this idea, in this paper, we propose to exploit
the nature of data used for HAR and attempt aligning fea-
tures along temporal dimension by integrating a soft version
of DTW into contrastive learning frameworks used in uni-
modal and multimodal settings, namely SimCLR [4] and
CMC [34].
3. Methodology
3.1. Problem Definition
Inertial signals, used in sensor-based HAR, are nor-
mally obtained from wearable sensors such as accelerom-
eters, gyroscopes and others. Thus, sensor-based HAR can
be considered a multivariate time-series classification task.
Specifically, at timestamp t, the input signal is defined as
xt= [x1
t, x2
t, . . . , xS
t]∈RS, where Sis a number of chan-
nels. Hence, a time-window with signals aggregated for T
timesteps can be written as Xi= [x1,x2,...,xT]. Fi-
nally, the goal is to associate each time-window with a cor-
rect output label y∈Y.
Similarly, in this paper we address multimodal HAR us-
ing inertial sensors Xiand 2D skeletal joints Xs. Skeleton
data is generally represented as a set of coordinates tracked
over time for a number of keypoints on a body. For any
skeleton sequence, we denote Tas the number of frames in
the sequence, Jas the number of joints and C= 2 as the
number of data channels, or dimensionality of coordinates.
Then, a skeleton sequence Xs∈RT×J×Cconsists of T
frames where the skeleton data for each frame is described
by Pt= [p1
t,p2
t,...,pJ
t]and pj
t∈RCis the position of
joint jat frame t.
3.2. Contrastive SSL for HAR
The vast majority of self-supervised learning frame-
works are divided into two stages, namely pre-training and
fine-tuning. The aim of pre-training, also referred to as a
pre-text task, is to train a feature encoder on an auxiliary
task derived from unlabeled data. In contrastive learning