
3
pairs. For example, SimCLR considered the augmented views
of the sample as positive pairs, while all the other samples
within the same mini-batch are considered as negative pairs
[9]. Also, MoCo increased the number of negative pairs by
keeping samples from other mini-batches in a memory bank
[11]. On the other hand, some recent algorithms neglected the
negative pairs and proposed using only positive pairs such as
BYOL [13] and SimSiam [12].
C. Self-supervised learning for Sleep Staging
The success of SSL in computer vision applications mo-
tivated their adoption for sleep stage classification. For ex-
ample, Mohsenvand et al. [22] and Jiang et al. [23] pro-
posed SimCLR-like methodologies and applied EEG-related
augmentations for sleep stage classification. Also, Banville
et al. applied three pretext tasks, i.e., relative positioning,
temporal shuffling, and contrastive predictive coding (CPC)
to explore the underlying structure of the unlabeled sleep
EEG data [24]. The CPC [10] algorithm predicts the future
timesteps in the time-series signal, which motivated other
works to build on it. For example, SleepDPC solved two
problems, i.e., predicting future representations of epochs,
and distinguishing epochs from other different epochs [25].
Also, TS-TCC proposed temporal and contextual contrasting
approaches to learn instance-wise representations about the
sleep EEG data [8]. In addition, SSLAPP developed a con-
trastive learning approach with attention-based augmentations
in the embedding space to add more positive pairs [26].
Last, CoSleep [14] and SleepECL [27] are yet another two
contrastive methods that exploit information, e.g., inter-epoch
dependency and frequency domain views, from EEG data to
obtain more positive pairs for contrastive learning.
III. EVALUATION FRAMEWORK
A. Preliminaries
In this section, we describe the SSL-related terminologies,
i.e., pretext tasks, contrastive learning, and downstream tasks.
1) Problem Formulation: We assume that the input is
single-channel EEG data in Rd, and each sample has one label
from one of Cclasses. The supervised downstream task has
an access to the inputs and the corresponding labels, while
the self-supervised learning algorithms have access only to
the inputs.
The SSC networks consist of three main parts. The first
is the feature extractor, which maps the input data into the
embedded space fφ:Rd→Rm1parameterized by neural
network parameters φ. The second is the temporal encoder
(TE), which is another intermediate network to improve the
temporal representations. The TE may change the dimension
of the embedded features fθ:Rm1→Rm. Finally, the
classifier fγ:Rm→RC, which produces the predictions.
The SSL algorithms learn φfrom unlabeled data, while fine-
tuning learns θand γwith also updating φ.
2) Pretext tasks: Pretext tasks refer to the pre-designed
tasks to learn the model generalized representations from the
unlabeled data. Here, we describe two main types of pretext
tasks, i.e., auxiliary, and contrastive tasks.
a) Auxiliary tasks: This category includes defining a new
task along with free-to-generate pseudo labels. These tasks
can be defined as classification, regression, or any others. In
the context of time-series applications, a new classification
auxiliary task was defined in [28], [29] by generating several
views to the signals using augmentations such as: adding noise,
rotation, and scaling . Each view was assigned a label, and the
model was pretrained to classify these transformations. This
approach showed success in learning underlying representa-
tions from unlabeled data. However, it is usually designed
with heuristics that might limit the generality of the learned
representations [8].
b) Contrastive learning: In contrastive learning, repre-
sentations are learned by comparing the similarity between
samples. In specific, we define positive and negative pairs
for each sample. Next, the feature extractor is trained to
achieve the contrastive objective, i.e., push the features of the
sample towards the positive pairs, and pull them away from
the negative pairs. These pairs are usually generated via data
augmentations. Notably, some studies [9], [30] relied on strong
successive augmentations and found them to be a key factor
in the success of their contrastive techniques.
Formally, given a dataset with Nunlabeled samples, we
generate two views for each sample x, i.e., {ˆ
xi,ˆ
xj}using data
augmentations. Therefore, in a multiviewed batch with Nsam-
ples for each view, we have a total of 2Nsamples. Next, the
feature extractor transforms them into the embedding space,
and a projection head h(·)is used to obtain low-dimensional
embeddings, i.e., zi=h(fφ(ˆ
xi)) and zj=h(fφ(ˆ
xj)). As-
suming that for an anchor sample indexed i∈I≡ {1. . . 2N},
and A(k)≡I\{k}. The objective of contrastive learning is to
encourage the similarity between positive pairs and separate
the negative pairs apart using the NT-Xent loss, defined as
follows:
LNT-Xent =−1
2NX
i∈I
log exp (zi·zj/τ)
Pa∈A(i)exp (zi·za/τ),(1)
where ·symbol denotes the inner dot product, and τis a
temperature parameter.
3) Downstream tasks: Downstream tasks are the main tasks
of interest that lacked a sufficient amount of labeled data
for training the deep learning models. In this paper, the
downstream task is sleep stage classification, i.e., classifying
the PSG epochs into one of five classes, i.e., W, N1, N2,
N3, and REM. However, in general, the downstream task
can be different and defined by various applications. Notably,
different pretext tasks can have a different impact on the
same downstream task. Therefore, it is important to design a
relevant pretext task to the problem of interest, to learn better
representations. Despite the numerous proposed methods in
self-supervised learning, identifying the proper pretext task is
still an open research question [31].
B. Sleep Stage Classification Models
We perform our experiments on three sleep stage classi-
fication models, i.e., DeepSleepNet [6], AttnSleep [7], and